'Re: [Evms-devel] EVMS: RAID resizing; extending to RAID level'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       evms-devel
Subject:    Re: [Evms-devel] EVMS: RAID resizing; extending to RAID level
From:       Jer Jackson <jerj () coplanar ! net>
Date:       2003-05-31 16:10:32
[Download RAW message or body]

I'm starting to feel like I'm ranting about this, but I think I have 
a very clean framework for this.  I have posted before about moving from
linux md RAID1 to dm (device-mapper) RAID1.  I will try to explain:

think about a 2 disk md raid1 running degraded... you can add a new
disk,
run raidhotadd (or use EVMS), and after the resync is completely
finished, you can pull the first disk out.  This happens online, and is
fault tolerant (ie robust over power failure).

The above could also be done with a dm RAID1 implementation, and perform
as the above example.  With dm however, arbitrary mappings below the
RAID1 can be achieved, since the metadata handling is not tied to the
RAID1.  This is very powerful, since the basic RAID1 dm primitive is
part of all other online, fault tolerant resizing schemes (I acknowledge
that someone may find an exception if they think hard enough... I don't
think it hurts the argument though... maybe hardware raid with battery
backed ram?)

Now consider how newer versions of badblocks can operate in read-write
on a block device nondestructively.  Badblocks copies the first N blocks
of the block device elsewhere (pretend it's somewhere on the disk for
sake of argument), does it's thing, then copies it back.  Not online or
fault tolerant.  But the BBR plugin takes care of keeping track of
displaced spots in an underlying storage object. If things are done in
the proper order and journaled (ie metadata flushed after each step and
before the next starts) what badblocks does can be done online and fault
tolerant.  Perhaps the DriveLink feature would be easier than BBR
however.

Using the dm raid1-like primitive combined with other techniques by
manipulating the dm mappings, many scenarios are easily supported:
(i'm making this up as I go, and I'm also making some assumptions about
the atomicity of changing dm mappings, which I am mostly ignorant about,
so let me know if you see a fatal flaw in my reasoning.)

Reconfiguring raid5 <--> raid1 or changing # disks in raid5:

Preamble: i will say things can be done in multiples of stripe size
called a chunk.
-find some temporary work areas on some disk(s) of size nr_raid_disks *
chunksize + parity overhead (if used).  Ideally this can be from unused
space at the end of the same disks, so no extra disks are required.
-create a small (1 or more chunks) raidX array out of the work area. 
This is so that temporarily moved data has the same redundancy as it had
originally.  It must have persistent metadata.
-insert dm raid1 at offset 0 of the target raidX array.  Raid1 "array"
is now degraded because it only has one member.  This is like inserting
the snapshot feature.
-Now add the work area raid as the second member.  raid1 syncs this
member, duplicating any new writes to both members, taking reads from
the first member.
-raid1 sync completes.
-create BBR (persistent remapping) with this area of the raid target's
produced object to the raid1.  now the remapping is persistent
-remove the target raidX chunk from the raid1 object.

up to now all that has happened is that there is an unused space at the
beginning of the raid object. it has been created online and is fault
tolerant.  Next, the free space is used to put the data back in the new
raid format.

-shrink the target raid from the beginning leaving a stripe at the
beginning of each member object.
-create a raid array of the new type using the empty space at the
beginning of the member objects.
-add this raid array to the raid1 array.
-the resync completes. one chunk of the target array has now been
converted to the new raid level.
-The BBR mapping must be moved from the raid1 to the new format raid
area.  I believe this will apppear atomic.
-the raid1 is destroyed.

Now the state is that there is a small raid of the new format at the
beginning of each member object, and the old target's format uses the
remaining space, it's contents beginning at an offset of the produced
object.

-repeating the process moves the split between new and old format until
the end of the produced object is reached.  then the array convertion is
finished, and the spare space and temporary raid region can be
destroyed, and the raid array metadata updated.

I'm sorry I don't have time to really think it through.  It's almost the
opposite approach... use something that is robust and online, then add
the bits needed to reconfigure/resize the raid. 

Cheers,

Jeremy

On Thu, 2003-05-29 at 23:23, Scott Smyth wrote: 
> Hi Kevin;
> 
> Thanks.  As you know, we already touched base with Mike on this
> issue of adding in more features surrounding RAID reconfiguration.
> 
> Kevin Corry wrote:
> <SNIP>
> > We'd love to have your help on developing these items! Mike Tran can bring you 
> > up to speed on the RAID-resize work. Our current plans are for resizing the 
> > various RAID levels (linear, 0, 1, and 4/5), and for simple reconfiguration 
> > of RAID-1 (changing n-way mirrors to m-way mirrors). We haven't really given 
> > much thought to a RAID-1-to-RAID-5 conversion tool, so we could definitely 
> > use your help if you are interested in that feature.
> 
> For raid level reconfiguration, we believe there are three steps: 1)
> adopt an existing idea (see below); and 2) make it robust over power
> loss, etc; and 3) make it happen online.
> 
> Essentially, we believe there is one existing program to draw from with
> raidreconf (in 1.00 raidtools) and another in the plans with mdadm
> (Neil Brown's new "raidtools").  Everyone could use these seperately.
> It makes sense to merge them with EVMS though given the roadmap
> to put RAID 1 resize in EVMS.  Thus, it makes to put raid level
> reconfiguration code in the same location physically where raid 1
> resize will exist (as the same raid level resize is a subset or
> raid level reconfiguration we have in mind).
> 
> One of the things that is NOT in raidreconf (and we do not know about
> the plans for mdadm) is any way to deal with power loss or any other
> loss of contact with the devices on is going from or to during
> the reconfiguration.  One of the reasons we would like to put
> raid level reconfiguration in evms is to utilize existing functionality
> (ie., snapshots) to protect against potential data loss during
> reconfiguration errors (power or network loss -- if doing network
> block devices).
> 
> The other reason to do this is just to offer some more functionality
> back into evms since we draw on it as well.
> 
> > As for ENBD, there are a couple of choices. A new plugin could be written to 
> > exclusively discover and manage ENBD devices, or the current disk manager 
> > plugin could be enhanced to add the necessary support. You might be 
> > interested in looking through the code in evms-2.0.1/plugins/disk/ as a 
> > starting point. The latest version is available in our CVS tree 
> > (http://sourceforge.net/cvs/?group_id=25076) in the "evms2" module. Since 
> > none of us have much experience working with ENBD, perhaps you could describe 
> > some of the extra features/functionality you are interested in, and we can 
> > continue the discussion from there.
> 
> We have the evms2 CVS tree in hand (or several hands in this case).
> ENBD could be used seperately again as we could alter /etc/evms.conf
> to look at /dev/nbd* devices and make them available.  One could
> then layer evms RAID over ENBD devices and have a synchronous mirror
> or RAID 4/5 across three to four system using Gb/s depending on
> what ENBD was importing.  Essentially, you can do this now without
> evms knowing anything about ENBD.  What would be nice is to
> start integrating stability/recovery features into ENBD usage in
> such cases within evms (ie., a network global "spare" system
> and fallback (in reserve snapshot volumes in you will) locations
> in case the network fails but you can accept writes for a while.
> Again, it could all be seperate, but we would like to give some
> things back to evms that might make sense for others.
> 
> Undoubtedly, the stability of enbd inside evms starts to look at
> HA ideas already in evms.  enbd is network based (and not really
> clustering) but definitely network aggregation which requires HA
> functionality.
> 
> We are still debating how much enbd work makes sense inside evms.
> As we layout the ideas, it would be nice to get feedback so others
> could use it as well.
> 
> thanks, Scott



-------------------------------------------------------
This SF.net email is sponsored by: eBay
Get office equipment for less on eBay!
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
_______________________________________________
Evms-devel mailing list
Evms-devel@lists.sourceforge.net
To subscribe/unsubscribe, please visit:
https://lists.sourceforge.net/lists/listinfo/evms-devel
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic