'Re: online disk replicator (draft 1)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha
Subject:    Re: online disk replicator (draft 1)
From:       "Stephen C. Tweedie" <sct () redhat ! com>
Date:       2000-03-06 22:22:07
[Download RAW message or body]

Hi,

On Thu, 2 Mar 2000 08:03:38 -0500, jetienne@arobas.net said:

> ok this is the first draft of odr (aka drbdv2). this doc is about 
> the protocol part. a second version of this document is in progress
> and will follow shortly. draftv2 will try to solve known bugs and
> allow multiple writers at the same time so wait a bit if you want 
> to seriously work on it.

OK, I'll wait for that before commenting in detail. 

For now, however, I'll just make a few observations about the overall
organisation of this thing.  Basically, you probably want to assume that
it doesn't exist on its own.  drbd doesn't even _begin_ to address the
Quorum problem, and so it simply isn't useful without some other cluster
software present.  If you make that assumption, you can assume that some
other software is doing things like failing over applications.  And if
you assume _that_, then you better assume that somebody else will be
telling you where to fail your drbd master over to, and when!

Secondly, there is a cluster recovery problem: synchronisation of the
various cooperating daemons on the separate nodes round the cluster.  If
there is a cluster infrastructure in place, then you can assume that you
will be given events of the form

   * Cluster transition has begun, stop serving new requests
   * Cluster transition complete, here's the new membership list, begin
     recovery now
   * Cluster recovery complete, resume serving new requests

It's a little more complex than that, but not much.

> - life monitor: the software/protocol which monitor the node's life in
>   the pool. if the master die, a take over is triggered.
> - takeover delay: the delay between the master's faillure and the moment
>   of the new master become active.
> - detection delay: the delay between the master's faillure and begining
>   of the takeover.

This is all somebody else's problem.  Please don't try to solve it
inside drbd, because drbd isn't going to be particularly useful unless
other software can be involved in the failover too.  (Eg. you want to
failover the drbd master, then remount the journaled filesystem on the
new node, then start up the database on the new node, then start your
dynamic CGI serving.)

--Stephen

------------------------------------------------------------------------------
Linux HA Web Site:
  http://linux-ha.org/
Linux HA HOWTO:
  http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html
------------------------------------------------------------------------------

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic