'Re: [Linux-ha-dev] RFC,'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha-dev
Subject:    Re: [Linux-ha-dev] RFC,
From:       Lars Marowsky-Bree <lmb () suse ! de>
Date:       2006-05-23 11:29:39
Message-ID: 20060523112939.GD31039 () marowsky-bree ! de
[Download RAW message or body]

On 2006-05-23T12:31:56, Lars Ellenberg <Lars.Ellenberg@linbit.com> wrote:

> LinBit hired an other developer, Rasto, who's first task was to
> implement the heartbeat plugin for resource-level fencing of DRBD
> resources.

Cool!

> Our intention would be that this eventually becomes part of the
> heartbeat CVS / release as soon as deemed appropriate.  (Alan?)

Sure, send a patch for review.

> Please review and comment.

I'm not entirely sure I understand what you're doing, but I'll go ask
questions about that below ;-) First, the meta-question: Why did you not
add this to the master/slave-aware resource agent for drbd we have in
heartbeat now?

I _think_ that that RA has enough information available for this, as we
explicitly promote/demote it and provide it with notifications about
what we're doing to its peer.

(ie, the secondary should assume it is "outdated" and refuse to promote
if the primary hasn't been stopped yet and so on.)

Or is there any fundamental problem with this approach and our
master-slave model which makes more external help necessary? I'd welcome
a review of our current drbd agent wrt to that.

> =====
> 
> What does it do, and why does it do it this way?
> 
> How DRBD behaves when we think that "someone" should do fencing:
> 
> for example:
> 
>  we are Connected Primary/Secondary
>  we lose our replication channel
> 
>  "fencing = dont-care" --> we just keep going,
>         (we are the primary after all!)
>         This is basically how drbd 0.7 behaves.
>         This risks diverging data sets, e.g. in case of split brain.
>         (NOTE: since DRBD is a "shared-nothing shared disk", we do
>         _NOT_ risk file system corruption; whether diverging data
>         sets are better or worse, is an other story)

Ok.

>  "fencing = resource" --> we invoke the "outdate-peer-handler",
>         Which up to now had been a hackish script using ssh,
>         but can now be configured to use the new drbd-outdate-peer
>         heartbeat plugin.
> 
>         heartbeat should not have stonith configured here, or we
>         risk that in the event of total communication loss -->
>         stonith the other node might win, and we might have
>         acknowledged transactions within the time period
>         "connection loss" to "beeing stonithed",
>         which then will be gone.
> 
>         This uses the heartbeat communication links, but
>         completely bypasses the crm or any heartbeat authority.
>         This is on purpose. It is only one of several possible
>         implementations of the concept.

I don't think you need a plugin. The fact that the link is broken but
that you're not getting notifications about the other side implies that
indeed, the link is broken (and not the other side gone).

>  "fencing = resource-and-stonith"
>         We expect heartbeat to have stonith configured, so we will
>         freeze all io immediately, invoke the outdate-peer-handler,
>         and will only unfreeze io when this handler returns success
>         -- or some higher authority explicitly unfreezes us.
>         The handler should attempt (or trigger) resource level
>         fencing first (mark peer as "outdated"), and fall back to
>         stonith if resource level fencing did not work out
>         (peer unreachable).

Again, this should work with the current RA already. You can stay frozen
until we "disconnect" you from the peer, essentially. (The current drbd
RA does so when it receives a notification that the other side has gone
away/been stopped, and re-connects when it is started again.)

>         This could also be called the "oracle mode",
>         though orcale people probably want write-quorum >= 2,
>         (which will be implemented someday, too ....)
> 
>     !!we need some help here!!
>         In this case the drbd outdate-peer-handler would need to
>         communicate with the crm in the fallback case
>         (peer unreachable, resource-fencing not possible),
>         if only to ask whether the other node got stonithed,
>         or to wait for the stonith operation to take place and complete,
>         or even to trigger such a stonith operation, then wait for
>         it to complete.

Again, if we've fenced the other node as such, you'll get a notification
that the clone has been stopped.

>  Does this make sense so far?
> 
> Some more notes:
> 
>  STONITH should, if configured, always be implemented as "switch off",
>  not as "reset", to avoid them stonithing each other.
>  (assume two-node cluster...
>   there is some problem with quorum based decisions here)

Agreed. This can be configured already, I think.

>  If you configure drbd fencing=resource, but have stonith configured in
>  heartbeat, that is a configuration error.
> 
>  If you configure drbd fencing=resource-and-stonith, but have no
>  stonith configured in heartbeat, that will freeze io uneccessarily.
> 
>  If you have fencing = resource, and no stonith configured, we need not
>  freeze io and still avoid diverging data sets even during total
>  communication loss: a secondary that has any doubt about the peers disk
>  state will refuse to become primary, whereas a primary that does not
>  know about its peers disk state will continue to be primary.
> 
>  If after a cluster crash the cluster should come up without
>  communication, one cannot promote drbd to primary until communication
>  is restored or "some authority" explicitly assures one of the nodes
>  that the other node has been fenced.

Agreed.

>  If one node knows that the peers disk is "bad" (has been marked "outdated",
>  is "inconsistent"), this is stored in meta data, so that a degraded
>  cluster may crash/reboot and become primary anyways.
> 
>  Obviously we do not store "peers disk is good", that would be stupid.
> 
> Comments, Please...
> 

Sincerely,
    Lars Marowsky-Brée

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
[prev in list] [next in list] [prev in thread] [next in thread]