'[smartmontools-support]Re: Finding an error reported by selftest'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       smartmontools-support
Subject:    [smartmontools-support]Re: Finding an error reported by selftest
From:       Douglas Gilbert <dougg () torque ! net>
Date:       2004-12-24 13:17:25
Message-ID: 41CC16E5.8060007 () torque ! net
[Download RAW message or body]

mike@coruscant.demon.co.uk wrote:

> Hi,
> 
> I've been playing with smartmontools, and have had an error diagnosed by
> one of the selftest, and I'm hoping someone can provide some pointers as
> to what to do next...
> 
> Quick summary - system is my main desktop, Debian Linux on x86, 4 SCSI
> disks set up using kernel software RAID-5. All the md devices are
> formatted with XFS. After running "smartctl -t long /dev/sdb", I get the
> following output from "smartctl -a /dev/sdb":-
> 
> =======================================================
> smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
> 
> Device: IBM      DDYS-T18350N     Version: S96H
> Serial number:         5EG73154
> Device type: disk
> Transport protocol: Fibre channel (FCP-2)
> Local Time is: Thu Dec 23 12:16:29 2004 GMT
> Device supports SMART and is Enabled
> Temperature Warning Disabled or Not Supported
> SMART Health Status: OK
> 
> Current Drive Temperature:     49 C
> Drive Trip Temperature:        85 C
> Manufactured in week 07 of year 2001
> Current start stop count:      436 times
> Recommended maximum start stop count:  10000 times
> 
> Error counter log:
>           Errors Corrected    Total      Total   Correction     Gigabytes    Total
>               delay:       [rereads/    errors   algorithm      processed    uncorrected
>             minor | major  rewrites]  corrected  invocations   [10^9 bytes]  errors
> read:          0        0         4         8         48         24.088          40
> write:         0        0         0         0          0         83.515           0
> 
> Non-medium error count:        0
> 
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
>      Description                              number   (hours)
> # 1  Background long   Failed in segment -->       5  3010  0x         2168629 [0x4 0x3e 0x3]
> =======================================================
> 
> My questions are:-
> 
> 1/ How serious is this? Disk on way out now, or just a bad area?
> 2/ Can I map it out?
> 3/ How can I map the LBA address to a file?
> 
> Things I've read make it clear how to answer 3/ for ext2 filesystems,
> but not XFS on RAID, and 5 minutes with the XFS debugger has not
> provided enough enlightenment yet...
> 
> TIA,
> 
> Mike.
> 
> P.S. I'm not a list subscriber at the moment, so I'd really appreciate a
> CC on any replies. Thanks.

Mike,
Thanks for the report. There is (or was) something wrong with
sector 0x2168629 **. Looking at the manual of a recent Hitachi
disk "segment number = 5" indicates a "ECC circuit test"
failure. The asc/ascq codes indicate "logical unit failed
self-test" which doesn't add any useful information.

Bad sectors observed when the disk was produced are placed
in the primary defect list. Any sectors found thereafter
are placed in the "grown" defect list. If you get the sg3_utils
(version 1.11) from http://www.torque.net/sg then try
'sginfo -G /dev/sdb' with various "-F<arg>" options.
Good disks should have empty grown defect list,
I suspect your disk does not. Next try to read that sector
with something like "sg_dd if=/dev/sg1 skip=0x2168629 of=t
bs=512 count=1". Even if it reports a ECC error you can still
read the sector with the sg_read_long utility, although it may
not make much sense given that disk is part of RAID-5.

The ARRE bit in the "read write error recovery" mode page controls
whether sectors with recoverable errors are remapped. This
can be viewed with "sginfo -e /dev/sdb". When the ARRE bit is set
only recoverable sectors (e.g. ECC can correct sector) are remapped.
The deceased sector should be added to the grown defect list.
Unrecoverable sectors need to be manually mapped out with the
REASSIGN BLOCK SCSI command and there is no utility in sg3_utils
to do that (yet). I'm sure scu can do that.

There could be further problems with that disk (e.g. at higher lba's)
so you should run 'smartctl -t long /dev/sdb' again until there is
a clean result. That disk should also be monitored closely. RAID-5
should protect your data.

As for mapping that sector to a file in ext2, I'm not familiar with
that level of trickery.


** your self-test output showed a formatting problem: that "0x" should
immediately precede the lba. So I have fixed that in CVS.

Please send me the rsults of the above tests.

Doug Gilbert


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Smartmontools-support mailing list
Smartmontools-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic