[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-raid
Subject:    Re: making raid5-root work
From:       David Mansfield <david () cobite ! com>
Date:       1998-08-31 22:12:21
[Download RAW message or body]


Maybe we should outline some cases and see in what cases we CAN
automatically survive a reboot, and in which cases we cannot.  Maybe we
can move some of the latter to the former, although I think we all agree
we can't win in every case.

In all cases a disk fails:

1) the BIOS default boot disk fails and the system does not require
rebooting. we can optionally let it run, and should be ok if we can take
the risk of operating in a degraded environment, or if we have a hot
spare.

When the disk is finally replaced:

1.1) it is replaced by an idiot who can't type anything (i.e. an IBM tech
service rep who knows how to work his screwdriver), and reboots with a
blank disk in the "pole position."  How can we most easily get the system
to boot and reconstruct?  If we have LILO installed on all disks, and a
copy of the kernel on all disks:

   A. A Novell expert I know swears up and down that if the first disk in
      the system is blank, and the second has a DOS partition, that DOS
      will boot.  If this is true, the the system should automatically
      boot with LILO as well... we need to confirm this.

   B. we reconfigure the boot device in either MB bios or SCSI adapter
      BIOS and the system boots. 

Automatic reconstruction may or may not even be desirable, but that is an
adminstrators choice.  In the current situation, we could have a fairly
functional rescue partition configured on each disk as well, which would
contain rebuild tools etc.

1.2) it is replaced by an expert.  ...

2) the BIOS default boot disk fails and the system reboots immediately,
withoug prompting (imaginable), or we decide in case 1) to reboot without
having replaced the drive.

2.1) the disk is still detected.  Still more cases:

2.1.1) the mbr is read, but is either corrupt, or the blocks that lilo 
       wants are corrupt.  I think we are SCREWED!  Need to change the
       boot sequence.

2.1.2) nothing can be read, will BIOS go to the next disk having failed to
       read an MBR? if yes, we are in 1.1 above, otherwise 2.1.1

2.2) the disk is not detected.  In this case, our LILO on the second disk
will definitely load but wait! It will probably stop with 'LI' because it
will be looking for it's boot.map file on BIOS services 0x81 disk, not
0x80!!!!  Aside from this we would have been fine... except that, as
someone else pointed out, the superblocks will also possible be wrong,
thinking /dev/sda1 (or whatever) is the first disk in a sequence, when in
fact /dev/sda1 is the second disk, and the first disk is missing! This
could be handled in the raid driver.  The kernel says: 'whoa, this is the
wrong disk, but wait, the right disk is missing! Let's see if I can find
the right RAID disks 2, 3, 4, 5.  Yes, they are here, things are ok. Lets 
go'

It looks like 2.2 has two issues. Getting lilo to boot, and getting the
raid to start with the devices "wrong."

David


-- 
/==============================\
| David Mansfield              |
| david@cobite.com             |
|                              |
| (212) 536-9115               |
\==============================/

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic