'Re: [Linux-PowerEdge] 2 predicted failure disks and RAID5'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-poweredge
Subject:    Re: [Linux-PowerEdge] 2 predicted failure disks and RAID5
From:       Stephen Dowdy <sdowdy () ucar ! edu>
Date:       2017-11-14 19:17:09
Message-ID: 10d8a69e-1879-3886-4e3b-88a16a1a0419 () ucar ! edu
[Download RAW message or body]

On 11/14/2017 11:52 AM, Grzegorz Bakalarski wrote:
> Thanks for valuable input.
> Regarding punctured block:  from fwtermlog I got several (not much) lines of type:
> 
> 11/13/17  3:24:45: EVT#08603-11/13/17  3:24:45:  97=Puncturing bad block on PD 02(e0x20/s2) \
> at 9ecd
that's bad.  You have a punctured stripe.

> T35:     maintainPdFailHistory=0 disablePuncturing=0 zeroBasedEnclEnumeration=1 \
> disableBootCLI=1
This is and informational line indicating that the controller doesn't have the \
disablePuncturing config option set.

> All the same PD, the same bad block (different time)
> 
> Is my raid useless?

No, it's good enough to recover what data you can before you rebuild it.  However, you can't \
trust the data that uses the bad block.   You'll get a read error from any object that maps to \
it.

Here's a good doc Dell put out:

https://www.dell.com/support/article/us/en/4/438291#2
   "...If the data within a punctured stripe is accessed errors will continue to be reported \
against the affected badLBAs with no possible correction available. Eventually (this could be \
minutes, days, weeks, months, etc.), the Bad Block Management (BBM) Table will fill up causing \
one or more drives to become flagged as predictive failure.,,,:

> BTW: why do think raid level migration to raid-6 with 2 additional disk would be better than \
> with one disk. I would keep VD size the same.

I'm not talking about a migration, i'm talking a complete WIPE of what you have, and a \
recreation from scratch.  At this point, you can recover what you can to a staging location, \
rebuild, then restore. Keep track of data with I/O errors, because it's going to have a \
corrupted block at the punctured block address.  this could (if you're lucky), be in \
unallocated space.  could also be in filesystem structures and lead to widescale corruption of \
the filesystem.

I would mount it all READONLY and do a file-level dump (not a 'dd' or anything like that, which \
would migrate corrupted filesystem structures).  (i typically 'rsync' data to another \
machine.).  You don't want any backup tool that does infinite retries, as it'll likely result \
in another disk failure. (from the above)  

> Anyway will migration too raid-6 fail with this  "awful Puncturing)???

RAID-6 is going to lessen the likelihood of a puncture, with 2 parity drives.  While you're \
rebuilding a RAID5, any unrecoverable bad block event on any of the "good" drives during the \
rebuild will result in a puncture, with RAID6, you still have parity to cope with an \
uncorrectable error.

The above is especially true of some of the less reliable seagate drives from past years.  You \
can't count on them not throwing UCEs during a rebuild (or before you get the replacement drive \
installed), thereby puncturing the RAID.  :-(

--stephen
-- 
Stephen Dowdy  -  Systems Administrator  -  NCAR/RAL
303.497.2869   -  sdowdy@ucar.edu        -  http://www.ral.ucar.edu/~sdowdy/

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge@dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge

[prev in list] [next in list] [prev in thread] [next in thread]