'Re: [zfs-discuss] Two disks giving errors in a raidz pool, advice needed'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       zfs-discuss
Subject:    Re: [zfs-discuss] Two disks giving errors in a raidz pool, advice needed
From:       Manuel Ryan <ryan () shamu ! ch>
Date:       2012-04-25 17:59:16
Message-ID: CAHZtd=WsMkjteykpqyprgdqroZNJjBVvKX08B_aKMcDkMhWPvQ () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

Hey again, I'm back with some news from my situation.

I tried taking out the faulty disk 5 and replacing it with a new disk, but
the pool showed up as FAULTED. So I plugged the faulting disk back keeping
the new disk in the machine, then ran a zpool replace.

After the new disk resilvered completely (took around 9 hours), the zpool
status still shows the disk as "replacing" but is not doing anything
(iostat not showing any disk activity). If I try to remove the faulty
drive, the pool shows up a DEGRADED now and still "replacing" the old
broken disk.

The overall state of the pool seems to have been getting worse, the other
failing disk is giving the write errors again, the pool had 28k corrupted
files (60k checksum errors on the raidz1 and  28k checksum errors on the
pool itself).

After seeing that, I tried to do a zpool clear to try and help the replace
process finish. After this, disk 1 was UNAVAIL due to too many IO errors
and the pool was DEGRADED.

I rebooted the machine, the pool is not back ONLINE with the disk5 still
saying "replacing" and 0 errors except permanent ones.

I don't really know what to try next :-/ any idea ?

On Mon, Apr 23, 2012 at 7:35 AM, Daniel Carosone <dan@geek.com.au> wrote:

> On Mon, Apr 23, 2012 at 05:48:16AM +0200, Manuel Ryan wrote:
> > After a reboot of the machine, I have no more write errors on disk 2
> (only
> > 4 checksum, not growing), I was able to access data which I previously
> > couldn't and now only the checksum errors on disk 5 are growing.
>
> Well, that's good, but what changed?   If it was just a reboot and
> perhaps power-cycle of the disks, I don't think you've solved much in
> the long term..
>
> > Fortunately, I was able to recover all important data in those conditions
> > (yeah !),
>
> .. though that's clearly the most important thing!
>
> If you're down to just checksum errors now, then run a scrub and see
> if they can all be repaired, before replacing the disk.  If you
> haven't been able to get a scrub complete, then either:
>  * delete unimportant / rescued data, until none of the problem
>   sectors are referenced any longer, or
>  * "replace" the disk like I suggested last time, with a copy under
>   zfs' nose and switch
>
> > And since I can live with loosing the pool now, I'll gamble away and
> > replace drive 5 tomorrow and if that fails i'll just destroy the pool,
> > replace the 2 physical disks and build a new one (maybe raidz2 this time
> :))
>
> You know what?  If you're prepared to do that in the worst of
> circumstances, it would be a very good idea to do that under the best
> of circumstances.  If you can, just rebuild it raidz2 and be happier
> next time something flaky happens with this hardware.
>
> > I'll try to leave all 6 original disks in the machine while replacing,
> > maybe zfs will be smart enough to use the 6 drives to build the
> replacement
> > disk ?
>
> I don't think it will.. others who know the code, feel free to comment
> otherwise.
>
> If you've got the physical space for the extra disk, why not keep it
> there and build the pool raidz2 with the same capacity?
>
> > It's a miracle that zpool still shows disk 5 as "ONLINE", here's a SMART
> > dump of disk 5 (1265 Current_Pending_Sector, ouch)
>
> That's all indicative of read errors. Note that your reallocated
> sector count on that disk is still low, so most of those will probably
> clear when overwritten and given a chance to re-map.
>
> If these all appeared suddenly, clearly the disk has developed a
> problem. Normally, they appear gradually as head sensitivity
> diminishes.
>
> How often do you normally run a scrub, before this happened?  It's
> possible they were accumulating for a while but went undetected for
> lack of read attempts to the disk.  Scrub more often!
>
> --
> Dan.
>
>

[Attachment #5 (text/html)]

<div class="gmail_extra">Hey again, I&#39;m back with some news from my \
situation.</div><div class="gmail_extra"><br></div><div class="gmail_extra">I tried \
taking out the faulty disk 5 and replacing it with a new disk, but the pool showed up \
as FAULTED. So I plugged the faulting disk back keeping the new disk in the machine, \
then ran a zpool replace.</div> <div class="gmail_extra"><br></div><div \
class="gmail_extra">After the new disk resilvered completely (took around 9 hours), \
the zpool status still shows the disk as &quot;replacing&quot; but is not doing \
anything (iostat not showing any disk activity). If I try to remove the faulty drive, \
the pool shows up a DEGRADED now and still &quot;replacing&quot; the old broken \
disk.</div> <div class="gmail_extra"><br></div><div class="gmail_extra">The overall \
state of the pool seems to have been getting worse, the other failing disk is giving \
the write errors again, the pool had 28k corrupted files (60k checksum errors on the \
raidz1 and  28k checksum errors on the pool itself).</div> <div \
class="gmail_extra"><br></div><div class="gmail_extra">After seeing that, I tried to \
do a zpool clear to try and help the replace process finish. After this, disk 1 was \
UNAVAIL due to too many IO errors and the pool was DEGRADED. </div> <div \
class="gmail_extra"><br></div><div class="gmail_extra">I rebooted the machine, the \
pool is not back ONLINE with the disk5 still saying &quot;replacing&quot; and 0 \
errors except permanent ones.</div><div class="gmail_extra"> <br></div><div \
class="gmail_extra">I don&#39;t really know what to try next :-/ any idea ?</div><div \
class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div \
class="gmail_extra"><br><div class="gmail_quote">On Mon, Apr 23, 2012 at 7:35 AM, \
Daniel Carosone <span dir="ltr">&lt;<a href="mailto:dan@geek.com.au" \
target="_blank">dan@geek.com.au</a>&gt;</span> wrote:<br> <blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div class="im">On Mon, Apr 23, 2012 at 05:48:16AM +0200, \
Manuel Ryan wrote:<br> &gt; After a reboot of the machine, I have no more write \
errors on disk 2 (only<br> &gt; 4 checksum, not growing), I was able to access data \
which I previously<br> &gt; couldn&#39;t and now only the checksum errors on disk 5 \
are growing.<br> <br>
</div>Well, that&#39;s good, but what changed?   If it was just a reboot and<br>
perhaps power-cycle of the disks, I don&#39;t think you&#39;ve solved much in<br>
the long term..<br>
<div class="im"><br>
&gt; Fortunately, I was able to recover all important data in those conditions<br>
&gt; (yeah !),<br>
<br>
</div>.. though that&#39;s clearly the most important thing!<br>
<br>
If you&#39;re down to just checksum errors now, then run a scrub and see<br>
if they can all be repaired, before replacing the disk.  If you<br>
haven&#39;t been able to get a scrub complete, then either:<br>
 * delete unimportant / rescued data, until none of the problem<br>
   sectors are referenced any longer, or<br>
 * &quot;replace&quot; the disk like I suggested last time, with a copy under<br>
   zfs&#39; nose and switch<br>
<div class="im"><br>
&gt; And since I can live with loosing the pool now, I&#39;ll gamble away and<br>
&gt; replace drive 5 tomorrow and if that fails i&#39;ll just destroy the pool,<br>
&gt; replace the 2 physical disks and build a new one (maybe raidz2 this time :))<br>
<br>
</div>You know what?  If you&#39;re prepared to do that in the worst of<br>
circumstances, it would be a very good idea to do that under the best<br>
of circumstances.  If you can, just rebuild it raidz2 and be happier<br>
next time something flaky happens with this hardware.<br>
<div class="im"><br>
&gt; I&#39;ll try to leave all 6 original disks in the machine while replacing,<br>
&gt; maybe zfs will be smart enough to use the 6 drives to build the replacement<br>
&gt; disk ?<br>
<br>
</div>I don&#39;t think it will.. others who know the code, feel free to comment<br>
otherwise.<br>
<br>
If you&#39;ve got the physical space for the extra disk, why not keep it<br>
there and build the pool raidz2 with the same capacity?<br>
<div class="im"><br>
&gt; It&#39;s a miracle that zpool still shows disk 5 as &quot;ONLINE&quot;, \
here&#39;s a SMART<br> &gt; dump of disk 5 (1265 Current_Pending_Sector, ouch)<br>
<br>
</div>That&#39;s all indicative of read errors. Note that your reallocated<br>
sector count on that disk is still low, so most of those will probably<br>
clear when overwritten and given a chance to re-map.<br>
<br>
If these all appeared suddenly, clearly the disk has developed a<br>
problem. Normally, they appear gradually as head sensitivity<br>
diminishes.<br>
<br>
How often do you normally run a scrub, before this happened?  It&#39;s<br>
possible they were accumulating for a while but went undetected for<br>
lack of read attempts to the disk.  Scrub more often!<br>
<br>
--<br>
Dan.<br>
<br>
</blockquote></div><br></div>

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[prev in list] [next in list] [prev in thread] [next in thread]