'EATA mbox in use errors on a DPT 3334UW'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-scsi
Subject:    EATA mbox in use errors on a DPT 3334UW
From:       Michael Brennen <mbrennen () fni ! com>
Date:       2001-11-15 18:40:47
[Download RAW message or body]


I am not subscribed to the list; please cc: me directly on all
replies.  TIA...

I'm having ongoing sporadic problems with a DPT 3334UW controller
giving mbox errors.  Though strange, I suspect that this may be
somehow network related, for reasons to be explained.  This is my
main web server, so it is causing a few problems.  If anyone can
give me any insight on this I would certainly appreciate it.

The dmesg information from bootup is below. There are eight IBM 4 GB
drives in a raid5 and two WD 4 GB drives in a raid1.

dmesg information:
---------------------------------------------------------------------
EATA0: IRQ 12 mapped to IO-APIC IRQ 18.
EATA/DMA 2.0x: Copyright (C) 1994-1999 Dario Ballabio.
EATA config options -> tc:n, lc:n, mq:2, eh:y, rs:y, et:n.
EATA0: 2.0C, PCI 0xe810, IRQ 18, BMST, SG 122, MB 64.
EATA0: wide SCSI support enabled, max_id 16, max_lun 8.
EATA0: SCSI channel 0 enabled, host target ID 7.
EATA0: SCSI channel 1 enabled, host target ID 7.
scsi0 : EATA/DMA 2.0x rev. 5.11.01
scsi : 1 host.
  Vendor: DPT       Model: RAID-1            Rev: 07M0
  Type:   Direct-Access                      ANSI SCSI revision: 02
Detected scsi disk sda at scsi0, channel 0, id 1, lun 0
  Vendor: DPT       Model: RAID-5            Rev: 07M0
  Type:   Direct-Access                      ANSI SCSI revision: 02
Detected scsi disk sdb at scsi0, channel 0, id 2, lun 0
EATA0: scsi0, channel 0, id 1, lun 0, cmds/lun 2, unsorted, untagged.
EATA0: scsi0, channel 0, id 2, lun 0, cmds/lun 2, unsorted, untagged.
SCSI device sda: hdwr sector= 512 bytes. Sectors= 8387802 [4095 MB] [4.1 GB]
 sda: sda1 < sda5 sda6 sda7 >
SCSI device sdb: hdwr sector= 512 bytes. Sectors= 53328000 [26039 MB] [26.0 GB]
 sdb: sdb1
---------------------------------------------------------------------

A typical crash situation is below.  These (or similar) messages
come up on the main console, and then the server is frozen. The
network is still up and responding to pings, but other than that the
box is not responsive and must be reset.

---------------------------------------------------------------------
EATA0, abort, mbox 2, target 0.1:0, pid 13688476
EATA0, abort, mbox 2 is in use
EATA0, abort, mbox 2, eh_state timeout, pid 13688476
EATA0, abort, mbox 3, target 0.1:0, pid 13688478
EATA0, abort, mbox 3 is in use
EATA0, abort, mbox 3, eh_state timeout, pid 13688478
EATA0, ihdlr, mbox 2 is free, count 13688646
EATA0, ihdlr, mbox 3 is free, count 13688647
---------------------------------------------------------------------

In my experience the DPT controller is good at detecting bad drives
and removing them from the array.  I don't think that has happened,
as all drives are active.

The hardware is a dual P-III 750 with a gigabyte of RAM; it is very
stable, with no problems compiling the kernel for months (i.e. good
RAM).  This is the latest 2.2.20 kernel source, with the latest
ReiserFS 3.5.34 patches applied.  The RAID1 is an ext2 file system;
the RAID5 is a Reiser file system.  The base installation is
Mandrake 7.1, with glibc 2.1.3.  The web server is running Apache
1.3.22 that I've built myself with various needed modules.

What makes me think that this is network related is that there is
almost always an exact correlation between an IIS server stopping on
a W2K box and the crash on the linux apache server.  The W2K box is
patched up to current specs against all the worms running around.
The W2K box does not crash, but IIS stops.  Issuing the command
'iisreset /restart' from a commmand prompts fixes the problem.
Strangely, other IIS servers on the same network are unaffected.

Tonight the linux box crashed again, and for the first time the IIS
server was unaffected.  ???

I do sometimes see the kernel message 'suspect tcp fragment' on the
linux console, but I've not been able to correlate these with any
crashes.

I am running the latest snort.  I am running the snortsnarf report,
but it is not showing anything unusual other than the typical
cmd.exe type attempts.  I've not had time yet to write another snort
logger that summarizes traffic over time to see what might be logged
around the time of the reboots.

This server has been online for probably about a year.  It has very
occasionally crashed with the mbox errors, but recently it is
getting much worse, crashing even twice yesterday.  Yet, it may go a
week or more without problems.

I checked it two days ago with a fresh 'chkrootkit', and it passed
fine.  I've always been on top of the security patches for the box,
and to the best of my knowledge it is unbroken.

In summary:

* Could this be caused by the Reiser filesystem on the 2.2 kernel?
I've researched the linux-kernel archives, and it seems that there
may yet be some very latent bugs in the 2.2 series.

* Could it be related in any way to tagged queueing?

* Could this be network related in any way you can think of?

Again, if you have any insight on this, or where to ask further, I
would very much appreciate it.

   -- Michael Brennen

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic