[prev in list] [next in list] [prev in thread] [next in thread] 

List:       netbsd-port-amd64
Subject:    major ATI IXP ahcisata lossage and unexpected raidframe save
From:       David Brownlee <abs () absd ! org>
Date:       2009-07-31 20:00:50
Message-ID: alpine.NEB.2.00.0907312100280.21994 () localhost
[Download RAW message or body]

   	I have an ABIT A-S78H 780G Socket AM2+ running NetBSD-5/amd64
   	and suffered major lossage on the SATA controller.

   	Under disk load the system would suddenly start churning out
   	messages like:

   	    wd0a: error writing fsbn 891505504 of 891505504-891505535 (wd0 bn 891505567; \
cn 884430 tn 2 sn 1), retrying  wd0: (interface CRC error)
   	    ahcisata0 port 0: device present, speed: 3.0Gb/s
   	    wd0: soft error (corrected)
   	    wd0a: error writing fsbn 1170739456 of 1170739456-1170739487 (wd0 bn \
1170739519; cn 1161447 tn 14 sn 61), retrying  wd0: (interface CRC error)
   	    ahcisata0 port 0: device present, speed: 3.0Gb/s
   	    wd0: soft error (corrected)


   	this had happened one before but a reboot seemed to clear
   	it. This time it just kept coming back. Raid rebuild would
   	trigger it, cvs update, even an rsync. It managed to toast
   	my pkgsrc checkout and a locally hosted svn repo. At one
   	point I left it retrying for a couple of hours without any
   	benefit. Switching the SATA controller from AHCI to Native
   	or Compatible IDE in the BIOS didn't help.

   	The system has two ~1TB RAID1 raidframe mirrors, one for
   	system and another for archive data. The second one hasn't
   	triggered any issues (though obviously doesn't have the
   	same usage pattern).

   	Pulling one or other of the 'problem' disks in the raid
   	mirror just let the errors on the other. Testing one
   	of them in another machine was unable to reproduce the
   	issue.

   	Potentially relevant dmesg lines:

   	    ahcisata0 at pci0 dev 17 function 0: vendor 0x1002 product 0x4391
   	    ahcisata0: interrupting at ioapic0 pin 22
   	    ahcisata0: AHCI revision 1.1, 6 ports, 32 command slots, features 0xf7228080
   	    atabus0 at ahcisata0 channel 0
   	    atabus1 at ahcisata0 channel 1
   	    atabus2 at ahcisata0 channel 2
   	    atabus3 at ahcisata0 channel 3
   	    atabus4 at ahcisata0 channel 4
   	    atabus5 at ahcisata0 channel 5
   	    ahcisata0 port 1: device present, speed: 3.0Gb/s
   	    ahcisata0 port 3: device present, speed: 3.0Gb/s
   	    ahcisata0 port 4: device present, speed: 3.0Gb/s
   	    ahcisata0 port 5: device present, speed: 3.0Gb/s
   	    wd0 at atabus1 drive 0: <SAMSUNG HD103UJ>
   	    wd0: drive supports 16-sector PIO transfers, LBA48 addressing
   	    wd0: 931 GB, 1938021 cyl, 16 head, 63 sec, 512 bytes/sect x 1953525168 \
sectors  wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 7
   	    wd0(ahcisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 \
(Ultra/133) (using DMA)  wd1 at atabus3 drive 0: <SAMSUNG HD154UI>
   	    wd1: drive supports 16-sector PIO transfers, LBA48 addressing
   	    wd1: 1397 GB, 2907021 cyl, 16 head, 63 sec, 512 bytes/sect x 2930277168 \
sectors  wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 7
   	    wd1(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 \
(Ultra/133) (using DMA)  wd2 at atabus4 drive 0: <SAMSUNG HD154UI>
   	    wd2: drive supports 16-sector PIO transfers, LBA48 addressing
   	    wd2: 1397 GB, 2907021 cyl, 16 head, 63 sec, 512 bytes/sect x 2930277168 \
sectors  wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 7
   	    wd2(ahcisata0:4:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 \
(Ultra/133) (using DMA)  wd3 at atabus5 drive 0: <SAMSUNG HD154UI>
   	    wd3: drive supports 16-sector PIO transfers, LBA48 addressing
   	    wd3: 1397 GB, 2907021 cyl, 16 head, 63 sec, 512 bytes/sect x 2930277168 \
sectors  wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 7
   	    wd3(ahcisata0:5:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 \
(Ultra/133) (usin

   	Anyone seen anything similar?

   	It seems to have gone away for now, but I'm specifically not
   	stressing those disks until I get some services migrated off
   	this box.

   	Oh, and the raidframe save- at one point raidframe hard failed one
   	of the disks, so the subsequent reboot death spiral left it alone.
   	So  *that* became the 'good' disk (which didn't have the svn
   	repo hosed).
   	Of course I have regular dirvish backups of everything, but its
   	nice to be able to use them as a check rather than a rebuild..


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic