[prev in list] [next in list] [prev in thread] [next in thread] 

List:       evms-devel
Subject:    Re: [Evms-devel] Raid 1 recovery failure...
From:       Tony Smith <tony () perforce ! com>
Date:       2004-04-29 15:35:12
Message-ID: 200404291635.12006.tony () perforce ! com
[Download RAW message or body]

Hi Kevin,

> On Wednesday 28 April 2004 12:18 pm, Tony Smith wrote:
> > Folks,
> >
> > I just dug myself out of a nasty hole with an EVMS recovery and I think
> > there may be some lessons to be learned from it so here is the sorry
> > tale.
> >
> > First some background: I'm running EVMS 2.3.0 on a stock 2.4.25 kernel.
> >
> > 4 x IDE drives: hda,hdb,hde,hdf
> >
> > hda2 and hde2 are in md/md0 - a RAID1 mirror
> > hdb2 and hdf2 are in md/md1 - a RAID1 mirror
> >
> > md2 is a RAID0 stripe of md0 and md1
> >
> > So far so good.
> >
> > So this system has been ticking along nicely for about a year. On Monday
> > it fell down, hard. I just about figured out why and the story goes like
> > this:
> >
> > hde starts to degrade, but the bad blocks are in areas of the disk
> > holding old, mouldy data that we haven't looked at for ages. Since the
> > data is never read, no errors are logged and EVMS thinks it's a good
> > disk.
> >
> > Months later (I guess) hda2 clocks and error when being updated. EVMS
> > rightly takes it out of the array and marks it as faulty.
> >
> > I then re-added it to the array to see if it would still fail (OK stupid,
> > but bear with me, adding a fresh disk would not have helped).
> >
> > EVMS starts to resync the mirrors, and during this process, the bad
> > blocks on hde get read. Whoops! So now, there are no perfect copies of
> > the data anywhere.
> >
> > hda was actually not terminally damaged and ideally I would have liked
> > EVMS to have let me force the dodgy disk back into the degraded array as
> > the master - having removed hde - so I could then insert one replacement
> > disk, sync up as best I could, and then insert another.
> >
> > Unfortunately, by this time evms_activate wasn't terminating (as far I
> > could tell it would not), so I ended up doing a full rebuild and restored
> > my backup.
>
> Sorry it took so much trouble to get up and running again. :(  But thanks
> for the feedback.

Not your fault - stuff happens and I learned a lot in the process. Don't 
worry, I still love EVMS!

> > If you guys could add something to allow sysadmins in this situation to
> > escape with only damage to their pride it would be much appreciated!
>
> EVMS currently allows a "forced rebuild" on a RAID-5 that has more than one
> "stale" child. It might be possible to do something similar for RAID-1,
> where all child objects have gone bad or stale, but one or more are still
> physically available and operational. We'll discuss it and see if something
> can be added.

That sounds good - thanks!

> > Personally, I'm planning to run a "dd if=/dev/hd? of=/dev/null" on a
> > periodic basis as a result of this failure - to generate the errors
> > nearer to the time they occur. If there's a better way, I'd love to hear
> > it.
>
> Obviously these kind of "hidden" errors are going to pop up at the wrong
> time. Doing a periodic "dd" from the various disks is a decent idea for
> detecting these errors earlier rather than later. I'm not sure if there's
> anything better than "dd" in this case.

OK, I'll proceed with that plan then. Thanks for confirming that I'm on the 
right track!

Tony


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE. 
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Evms-devel mailing list
Evms-devel@lists.sourceforge.net
To subscribe/unsubscribe, please visit:
https://lists.sourceforge.net/lists/listinfo/evms-devel
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic