[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-smp
Subject:    Re: Nmi_watchdog and x86_64 lockups
From:       "Nielsen, Eric" <eric.nielsen () thomson ! com>
Date:       2005-03-11 0:20:12
Message-ID: BD964DF2053D1A498C2B182A9B2945B8031EF9A9 () eg-msgmbx-b07 ! int ! westgroup ! com
[Download RAW message or body]

This is fine.  No crown jewels here.  Let it fly.

--------------------------
Sent from my BlackBerry Wireless Handheld


-----Original Message-----
From: Ulstad, Jeremy (TLR Corp) <jeremy.ulstad@thomson.com>
To: linux-smp@vger.kernel.org <linux-smp@vger.kernel.org>
CC: Bluhm, Mark (TLR Corp) <mark.bluhm@thomson.com>
Sent: Thu Mar 10 18:18:18 2005
Subject: Nmi_watchdog and x86_64 lockups

Having narrowly skirted death by allowing photographers near the lab today.
. .

I would like to enlist the open source community in debugging our Oracle
problem.   Online kernel docs recommend reporting issues with NMI (related
to our lockup/dump issue) to the kernel-smp list.

I have composed the following email, but want to make sure you are
comfortable with me pursuing this.   I do not mention any application
details, but it is not possible to omit fairly detailed descriptions of the
hardware when submitting to the kernel list.   Not sure if that is kosher or
not.

Please let me know how I should proceed with this.

Domo Arigato.

Jeremy


Hypothetical email:
-------------------
I am looking for assistance with x86_64 SMP systems locking up.  Under a
heavy application workload, the system freezes and I am unable to send an
alt-sysrq-d to trigger a dump.   The systems are booting with nmi_watchdog=1
set, but the watchdog is not working.   No oops events are registered in
messages and I have observed nothing on the console (direct attached KVM -
working on setting up a term server and logging serial console).

According to nmi_watchdog.txt, I should see non-zero counters in
/proc/interrupts with this enabled or "you probably have a processor that
needs to be
added to the nmi code". 

The lockups are occurring in two separate configurations (details below),
both of which are showing all zeros for NMI in /proc/interrupts.  Any advice
on if these configurations are supported by the NMI code or suggestions for
how to successfully get a dump would be most appreciated.

Thanks in advance,

Jeremy Ulstad

Config 1:  2 x AMD Opteron 240 (8 GB RAM)
SLES 9
Linux number6 2.6.5-7.111.19-smp #1 SMP Fri Dec 10 15:10:58 UTC 2004 x86_64
x86_64 x86_64 GNU/Linux

number6:~ # cat /proc/interrupts 
           CPU0       CPU1       
  0:     383170   23276745    IO-APIC-edge  timer
  1:          9        227    IO-APIC-edge  i8042
  2:          0          0          XT-PIC  cascade
  8:          0          0    IO-APIC-edge  rtc
  9:          0          0   IO-APIC-level  acpi
 12:        207          0    IO-APIC-edge  i8042
 14:       4900      57432    IO-APIC-edge  ide0
 15:         54          0    IO-APIC-edge  ide1
 19:          0          0   IO-APIC-level  ohci_hcd, ohci_hcd
 27:  327047839          0   IO-APIC-level  eth0, eth1
NMI:          0          0 
LOC:   23656684   23657709 
ERR:          0
MIS:          0

Config 2: 4 x AMD Opteron 850 (8 GB RAM)
SLES 9
Linux riddick 2.6.5-7.145-smp #1 SMP Thu Jan 27 09:19:29 UTC 2005 x86_64
x86_64 x86_64 GNU/Linux

riddick:~ # cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       
  0:   20317266   25048606   25048495   25048500    IO-APIC-edge  timer
  1:          9          0          0          0    IO-APIC-edge  i8042
  2:          0          0          0          0          XT-PIC  cascade
  4:        652         92          0          0    IO-APIC-edge  serial
  8:          0          0          0          0    IO-APIC-edge  rtc
  9:          0          0          0          0   IO-APIC-level  acpi
 12:         59          0          0          0    IO-APIC-edge  i8042
 15:         63          4          0          0    IO-APIC-edge  ide1
 19:          0          0          0          0   IO-APIC-level  ohci_hcd,
ohci_hcd
 25:   93875682          0          1         81   IO-APIC-level  eth0
 27:          0     275078      99550       4603   IO-APIC-level  ioc0
NMI:          0          0          0          0 
LOC:   95441672   95441724   95441724   95441606 
ERR:          0
MIS:          0

I should also note that all the config 1 systems are being forced to 3.8 GB
of memory with "mem=3800m" to compensate for a bug with lkcd which results
in dumps (triggered manually with system up) failing with >= 4GB RAM.
-
To unsubscribe from this list: send the line "unsubscribe linux-smp" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic