[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-smp
Subject:    FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP
From:       Michal Szymanski <msz () astrouw ! edu ! pl>
Date:       2006-04-18 19:11:02
Message-ID: 20060418191102.GA15132 () astrouw ! edu ! pl
[Download RAW message or body]

Hi all,

I have recently purchased three Supermicro AS1020A-T servers equipped
with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB
RAM. The systems carry FC4 x86_64 with proprietary driver (made by
Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original
(install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due
to the lack of the SATA driver for other kernel versions.

All systems crash (either hang with some "machine check exception"
kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU
intensive with some I/O. I run 2 or 4 jobs simultaneously and they had
never survived more than a few hours.

Suspecting it may be the SATA driver problem I mounted /tmp as "tmpfs"
and repeated the tests entirely in /tmp (with plenty of RAM this means
(IMHO) doing I/O in memory). No success.

It is somewhat better when I run similar size no-I/O jobs but these also
crash, although less frequently.

I tried to install i386 version, also crashes. Same (or even worse) with
FC3.

Memtest does not show any RAM errors. 

Finally I did two tests which seem to have excluded SATA
controller/driver as the reason for crashes:

1. I installed an additional IDE hard disk and put FC4/x86_64 system on
it (without the Adaptec driver, so the system does not even see the SATA
disks), updated the kernel to the latest (2.6.16) - also crashed.

2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine.
There have been two test repeating 1.3g jobs running on it (each getting 50%
of the single CPU used by the system) for over 50 hours now, no crashes.
Also, a single test job running on SMP kernel gave no crashes in 24 hours.

It seems there is a problem with SMP kernel and dual-core Opterons, at
least on this hardware. I am stuck with three top-level machines which
can work only at 25% of nominal cpu power. Any hints would be
appreciated.

regards, Michal.

-- 
  Michal Szymanski (msz at astrouw dot edu dot pl)
  Warsaw University Observatory, Warszawa, POLAND
-
To unsubscribe from this list: send the line "unsubscribe linux-smp" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic