'RE: Disconcerting journal commit I/O error RHEL+2.6.9-34.0.2'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-poweredge
Subject:    RE: Disconcerting journal commit I/O error RHEL+2.6.9-34.0.2
From:       "Jason Wozniak" <jwozniak () henkels ! com>
Date:       2006-07-28 14:13:12
Message-ID: E4D849A083AC424697580C93F731F470027E099E () pabbhex1 ! henkels ! com
[Download RAW message or body]

I've experienced lock ups on my 2850's, but I'm not sure it was the
exact problem you describe.  I had my 6850 lock up with no logging of
errors, however my 2850 logged errors for the virtual cdrom, and I had
seen the same errors on my 6850 before.  I disabled the virtual cdrom in
the DRAC, and haven't had a problem since.  I've got my fingers crossed,
as it's only been about a month since the last occurrence.

I had also experienced lock ups on another 6850, and a 2850 when running
diagnostics on the drac under server administrator 4.5.  I caused this
to occur with Dell on site to investigate the problem(after we made lots
of noise).  The point is I've had nothing but problems with the Dell
DRAC's over the past year.  If you're in doubt, and don't need it
disable it is my motto.  

I'm still running server administrator 5.0 with the drac's in all my
servers disabled on 
Redhat 2.6.9-34.0.2.ELsmp/largesmp for the past month in production with
no further issues.  I also run on disk oracle backups of half a terabyte
which is a lot of writes.  I've seen the journal commit IO error in the
past, on a 6650 generally when I experienced problems with the raid
controller, and saw errors on the scsi bus.  I don't recall seeing this
error recently.

I saw one of the Dell techs post on this list that the Dell Drac can
reset for various reasons which looks like a hot plug event on the pci
express bus, and to set hdf=ide-scsi on the kernel boot options.  Only
Linus Torvalds describes this module as cr*p to put it bluntly, thus I
find disabling the DRAC's virtual cdrom to be a much more satisfying way
of making sure it doesn't screw with the other scsi devices in the
system.

Our problems could be completely unrelated, but since you seem to be
grasping at straws I thought I'd give you my 2 cents on which one to
grasp.

-----Original Message-----
From: linux-poweredge-bounces@dell.com
[mailto:linux-poweredge-bounces@dell.com] On Behalf Of Jason Young
Sent: Friday, July 28, 2006 9:12 AM
To: linux-poweredge@dell.com
Subject: Disconcerting journal commit I/O error RHEL+2.6.9-34.0.2

Hi all,

Two weeks ago, right after the Red Hat Kernel update to  
2.6.9-34.0.2.ELsmp (RHEL4) I started getting a journal commit I/O  
error on two of my servers with the srvadmin-all rpm's installed  
(version 5).

One was a 2800 running WS the other a 2850 running AS - both with the  
OEM PERC controllers that came with the servers.     All firmware/ 
bios updates are up to the latest release versions available.

The error came after a moderate amount of writes (either installing  
ruby on the freshly reinstalled 2850 or processing some webstats with  
awstats on the 2800) - and when the journal commit error occurs -  
every mounted volume goes read only - which obviously wreaks havoc on  
the running operating system.   The problem occurred twice on the  
2800, and once on the 2850.    It was not (yet) occurring on my other  
2850's and 1850's - running RHELv4, ws and as both - also with the  
version srvadmin-all rpm's (and the srvadmin-rac4 RPM's where  
appropriate).  Those were/are still running 2.6.9-34.0.1

My filesystem is a normal primary ext3 /boot, and the rest of the  
RAID (either all RAID5 or a two disk RAID1 and 3 disk RAID5  on the  
six-drive 2850) is a PV with various sized LVM2 logical volumes for  
slash, /var, /home, etc.

The problem freaked me out more than a little, the two servers it was  
happening on are not-yet-production, and obviously the last thing I  
needed was the problem to spread to production systems.  There's no  
logs obviously, because /var goes read-only like everything else.

Grasping at "what changed" straws - I froze going to kernel  
2.6.9-34.0.2 everywhere else - and proceeded to pull the Dell  
srvadmin RPM's everywhere (I know that openipmi is a kernel module,  
and didn't want it to be a question mark).

- No problems on the 2.6.9-34.0.2 boxes since I pulled openipmi and  
the other rpm's.
- Still no problems with the 2.6.9-34.0.1 boxes.   I have a few  
vmware (esx) VM's that have gone to 2.6.9-34.0.2 without problem, but  
no other physical servers.

I'd like to put the Dell software back, because I like it, and I'm  
not sure it's the culprit at all.   But I'm a bit gunshy at the  
moment, and like the fact that the filesystems aren't "locking up" on  
me anymore.   But I'm a bit of a loss to troubleshoot the problem  
since there's nothing that can get logged when it happens.    Logs up  
until it happens didn't give me any indication of a pending problem.

Thoughts?  ideas?

Jason
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jason Young --  Systems Manager, eXtension
  http://about.extension.org/wiki/Jason_Young
______________________________________

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge@dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge@dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq
[prev in list] [next in list] [prev in thread] [next in thread]