[prev in list] [next in list] [prev in thread] [next in thread]
List: linux-smp
Subject: Re: Nmi_watchdog and x86_64 lockups
From: "Nielsen, Eric" <eric.nielsen () thomson ! com>
Date: 2005-03-11 0:20:12
Message-ID: BD964DF2053D1A498C2B182A9B2945B8031EF9A9 () eg-msgmbx-b07 ! int ! westgroup ! com
[Download RAW message or body]
This is fine. No crown jewels here. Let it fly.
--------------------------
Sent from my BlackBerry Wireless Handheld
-----Original Message-----
From: Ulstad, Jeremy (TLR Corp) <jeremy.ulstad@thomson.com>
To: linux-smp@vger.kernel.org <linux-smp@vger.kernel.org>
CC: Bluhm, Mark (TLR Corp) <mark.bluhm@thomson.com>
Sent: Thu Mar 10 18:18:18 2005
Subject: Nmi_watchdog and x86_64 lockups
Having narrowly skirted death by allowing photographers near the lab today.
. .
I would like to enlist the open source community in debugging our Oracle
problem. Online kernel docs recommend reporting issues with NMI (related
to our lockup/dump issue) to the kernel-smp list.
I have composed the following email, but want to make sure you are
comfortable with me pursuing this. I do not mention any application
details, but it is not possible to omit fairly detailed descriptions of the
hardware when submitting to the kernel list. Not sure if that is kosher or
not.
Please let me know how I should proceed with this.
Domo Arigato.
Jeremy
Hypothetical email:
-------------------
I am looking for assistance with x86_64 SMP systems locking up. Under a
heavy application workload, the system freezes and I am unable to send an
alt-sysrq-d to trigger a dump. The systems are booting with nmi_watchdog=1
set, but the watchdog is not working. No oops events are registered in
messages and I have observed nothing on the console (direct attached KVM -
working on setting up a term server and logging serial console).
According to nmi_watchdog.txt, I should see non-zero counters in
/proc/interrupts with this enabled or "you probably have a processor that
needs to be
added to the nmi code".
The lockups are occurring in two separate configurations (details below),
both of which are showing all zeros for NMI in /proc/interrupts. Any advice
on if these configurations are supported by the NMI code or suggestions for
how to successfully get a dump would be most appreciated.
Thanks in advance,
Jeremy Ulstad
Config 1: 2 x AMD Opteron 240 (8 GB RAM)
SLES 9
Linux number6 2.6.5-7.111.19-smp #1 SMP Fri Dec 10 15:10:58 UTC 2004 x86_64
x86_64 x86_64 GNU/Linux
number6:~ # cat /proc/interrupts
CPU0 CPU1
0: 383170 23276745 IO-APIC-edge timer
1: 9 227 IO-APIC-edge i8042
2: 0 0 XT-PIC cascade
8: 0 0 IO-APIC-edge rtc
9: 0 0 IO-APIC-level acpi
12: 207 0 IO-APIC-edge i8042
14: 4900 57432 IO-APIC-edge ide0
15: 54 0 IO-APIC-edge ide1
19: 0 0 IO-APIC-level ohci_hcd, ohci_hcd
27: 327047839 0 IO-APIC-level eth0, eth1
NMI: 0 0
LOC: 23656684 23657709
ERR: 0
MIS: 0
Config 2: 4 x AMD Opteron 850 (8 GB RAM)
SLES 9
Linux riddick 2.6.5-7.145-smp #1 SMP Thu Jan 27 09:19:29 UTC 2005 x86_64
x86_64 x86_64 GNU/Linux
riddick:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 20317266 25048606 25048495 25048500 IO-APIC-edge timer
1: 9 0 0 0 IO-APIC-edge i8042
2: 0 0 0 0 XT-PIC cascade
4: 652 92 0 0 IO-APIC-edge serial
8: 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 IO-APIC-level acpi
12: 59 0 0 0 IO-APIC-edge i8042
15: 63 4 0 0 IO-APIC-edge ide1
19: 0 0 0 0 IO-APIC-level ohci_hcd,
ohci_hcd
25: 93875682 0 1 81 IO-APIC-level eth0
27: 0 275078 99550 4603 IO-APIC-level ioc0
NMI: 0 0 0 0
LOC: 95441672 95441724 95441724 95441606
ERR: 0
MIS: 0
I should also note that all the config 1 systems are being forced to 3.8 GB
of memory with "mem=3800m" to compensate for a bug with lkcd which results
in dumps (triggered manually with system up) failing with >= 4GB RAM.
-
To unsubscribe from this list: send the line "unsubscribe linux-smp" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic