'[Linux-ha-dev] I/O Fencing Proposal, Draft 0.4 (Grits/Natalie)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha-dev
Subject:    [Linux-ha-dev] I/O Fencing Proposal, Draft 0.4 (Grits/Natalie)
From:       David Brower <dbrower () us ! oracle ! com>
Date:       2000-03-27 22:39:15
[Download RAW message or body]

Here is a new draft of the proposal for providing a framework for
I/O fencing, and for implementation with NFS storage.

(Nobody saw 0.3x.)

I am particularly interested in feedback on the use of HTTP for NATALIE,
and the encoding suggested.

thanks!

-dB

============================================================================

		      I/O Fencing for Clusters

	  Generic Resource Intervention Tool Service (GRITS)
				 and
     NFS Admin Tool And Location Intervention Extension (NATALIE)

			   ----------------

	     David Brower (mailto:dbrower@us.oracle.com)
		John Leys (mailto:jleys@us.oracle.com)
	      Gary Young (mailto:gdyoung@us.oracle.com)

History

    public version 0.4 27-Mar-00

	- Verbs handle deal with errors, quorum check, and gen wrap.
	- GRITS for quorum resolution discussed.
	- http/s for Natalie

Discussion Archives:

    http://lists.tummy.com/pipermail/linux-ha-dev/

Abstract

    Cluster systems with shared resources, such as disk, need
    "fencing" of those resources during and after membership
    reconfigurations.  There are no general solutions to providing a
    mechanism for fencing in the Open Standards world, with existing
    solutions tied tightly to particular membership services and i/o
    systems.  This note outlines the architecture of a generic service
    (GRITS) for organizing fencing interactions between membership
    services and resources.  It also describes a mechanism (NATALIE)
    by which NFS services may be extended to become a safely fenceable
    resource under GRITS.  Other resources, such as shared-scsi disk
    drivers, SAN switches and the like, should be equally capable of
    becoming GRITS-able partners.  Because the solution is openly
    released, it is hoped that system providers, SAN vendors, and the
    purveyors of storage systems will incorporate appropriate agents,
    allowing for reliable clusters with shared, fence-able resources.

GRITS Architecture

    A GRITS cluster consists of:

	Some number of nodes;

	Some number of membership quorum group services.
	    Groups have quorum generation numbers of
	    at least 16 bits; wraps are allowed, and handled.

	Some number of resources used by quorum groups.

    Nodes are identified by either IP address or resolvable name.
    Resources are identified the same way, and are assumed to have at
    least an IP capable proxy -- something that will respond to IP,
    even if it needs to take some other path to an actual resource.

    Each GRITS membership quorum group has a configuration identifying
    the resources that may be used by the group, including the
    destination for GRITS control messages to the resource.  The
    quorum group service also provides multiple access points for
    resources to query the group when that is necessary.  Each GRITS
    group issuing commands and responding to queries is required to
    have established quorum.  Each has a generation number, which is
    only seen outside the membership service once quorum has been
    established.

    Each GRITS resource has a configured list of quorum groups and
    hosts that may possibly access it.  The configuration identifies
    the destinations for querys by the resource of the group.  The
    resource itself has at least one identified access point for
    control messages to it from the resources.

    (The configurations of groups and resources are expected to be
    slowly changing, and their control is not defined by GRITS.)

    GRITS controls access to resources by nodes depending on the
    quorum group membership.  Access is either permitted or denied to
    whole nodes, with no finer granularity.

    [Finer granularity is desirable, but hard to achieve.  It would
    seem to be necessary to associate groups with processes, and make
    the groups, or a group cookie or key get carried along with
    requests on behalf of processes.  For instance, the key associated
    with a fibre-channel persistent reservation might be an excellent
    way to allow/disallow members.  It may be very difficult to
    arrange for the key sent by the driver for an i/o on behalf of one
    process to be different than the key used for i/o by another
    process.]

    Resources that must remain writable to all during cluster
    transition, perhaps because they are used as part of the
    membership quorum resolution, should not be under GRITS control.

    --> Fencing can be used as part of quorum resolution.  This can be
    done with a separate "quorum" group, arbitrating access to the
    quorum resource.  This is discussed in the Appendix.

    At resource boot time, the resource examines the configuration,
    and adopts an access posture towards the potential members of all
    the groups.  First, it sees the configured boot policy associated
    with each group member.  Then it may also use GRITS defined
    messaging to communicate with the configured membership groups to
    set the correct current access rights.  At a true cold boot, there
    may be no groups to respond, so the configured boot posture
    remains in effect until a quorum group is formed and issues
    commands to the resources.  The plausible initial policies are
    "read only" and "no access"; some resources may only be able to
    enforce "no access".  A "writable" boot policy would be defeat
    the purpose of the fence.

    Once an initial posture is established by a resource, membership
    change events in the quorum group drive GRITS control messages to
    all the resources configured for the group.  These will deny
    access to departing members and allow access to continuing or
    joining members.  The quorum group cannot proceed out of its
    reconfiguration stage until the correct fencing of all resources
    has been accomplished.

    It is intended that "gritty" agents can be written and put in
    place for:

    - directly attached disks on shared SCSI.  The agent would
      communicate with some kernel-level code to manipulate the
      SCSI reset and SCSI reserve to arbitrate access to the resource;
      GRITS would talk to both sides to force to a known state.

    - SAN attached storage, where the agent could program tokens or
      domains of the fabric to control access;

    - NFS attached storage, where an agent could use NATALIE
      capabilities to narrow access below that of the basic exports;

    - SMB attached storage, where an agent could communicate to the
      software doing the "sharing" to control access.

    - General network attached storage, where control may be achieved
      by filtering in a router or proxy between the nodes and the
      resources.

    - Worst Case group members, who will have a third party wired
      to their reset buttons to force them to be "fenced."  This
      is an always correct final solution.  The "X10" system can
      be used to turn off the power to particularly non-cooperative
      entities.

    Mixtures of these agencies may be needed, depending on the needs
    and topology of the cluster in question.  The resource providers
    may or may not be on hosts that are part of any group in question.

    OPEN: The current proposal does not address layers of fencing or
    escalation and isolation.  It might be useful to identify levels
    at which fencing may be stopped without doing higher levels.  For
    instance, if all disk i/o may be stopped by frobbing the
    fibrechannel switch, then turning off the power may not be
    necessary.

Protocols

    At the architectural level, GRITS is agnostic about the control
    protocols.  The service could be provided using a variety of
    communication mechanisms.  The messages are defined in terms of
    verbs that may be bound to different techniques.  In practice,
    there will need to be some commonality of protocol.  It will not
    do to have a resource attempt to query a group using a protocol
    the group does not support, nor can a group meaningfully send
    a membership change to a resource without common ground.

    The following protocols are likely candidates:

	ONC (Sun) RPC
	HTTP
	HTTPS

    Exact bindings and support is area for discussion, decision,
    and documentation for interoperability.

Security

    Only "authorized" parties may be allowed to invoke the verbs.
    This is handled, pitifully, by a "cookie", a shared secret between
    resources and group services.  A secure protocol would protect the
    contents of the cookie, but is not an essential part of the
    architecture.  As is traditional in cluster discussions, we may
    presume for the moment that traffic between nodes and resources is
    on a secure network.

    Only current quorum holding membership services should be invoking
    commands, except that a member may always fence itself from
    resources.  (It may not unfence without obtaining quorum.)

    To enforce this, GRITS has some knowledge about quorum
    generations.  A quorum generation is an ever increasing number
    bumped at the time the membership is changed and confirmed.  This
    is distinct from a raw cluster, which may exist without the
    presence of quorum.  For purposes of GRITS, only quorum
    generations exist, and cluster generations are never seen.  For
    example, a cluster with a quorum generation of 10 experiences a
    partition, which drives reconfiguration.  Several partitions may
    each decide to have generation 11 as they seek quorum.  All but
    one of these will lose the quorum determination, and their
    existence at generation 11 will never be seen by GRITS.  Only the
    surviving quorum holder at generation 11 may issue GRITS commands.
    Therefore, GRITS communication need only obey commands from the
    latest cluster generation.  It may challenge, discard or return
    error to late-arriving commands from earlier generations.  The
    challenge protocol nests a "SetMe" call back from a resource
    to a group that is issuing a "Set" command.

Resource Settings

    A resourceSetting is a combination of

	{ resource, node, allow|deny }

    ResourceSettings are a list or array of resourceSettings to
    cover a set of resource/node bindings.

Verbs

    Resource to GroupService

	errstat SetMe( in resourceName, out resourceSettings,
	               out groupGeneration );

	    The resource may optionally invoke a SetMe operation
	    against the group after it establishes boot posture.
	    If it succeeds, it may prevent surviving members from
	    seeing i/o errors that would result in more recovery
	    and cluster transition activities.

	    When a group member receives this call, it -MUST-
	    revalidate that it does currently hold the quorum.  If it
	    no longer has quorum, it returns that as an error status.

	    The resource may issue SetMe in response to a Set
	    operation if it suspect the invoker of the Set may not be
	    the quorum holder.  To prevent races, the SetMe is issued
	    -before- the return from the Set is

    GroupService to Resource

	errstat Set( in cookie, in groupGeneration, in resourceSettings );

	    The set operation is used by a quorum holding group to
	    fence out hosts to be excluded, and to allow operations by
	    hosts that are members of the group.  It will only be
	    obeyed if issued by the group holding quorum.  The set of
	    members in the resourceSettings list should be complete,
	    covering all hosts to be excluded, and all those to be
	    included.

	    If the cookie provided does not match the configured
	    cookie, the Set request will be rejected and return error
	    status.

	    If the resource suspects the caller does not have quorum,
	    the resource may call the invoker back with a SetMe
	    operation.  The caller of Set must be able to respond to
	    the SetMe while the Set call is in progress.  Depending on
	    the response to the SetMe, the Set call will return with
	    different status.  This nested invocation prevents error
	    handling races in the group side, as the group may only
	    progress once the Set is completed successfully.

	    There are three situations where the resource may suspect
	    the quorum state of the caller:

		(1) The resource has booted, and has no previous
		    cluster generation to check against.
		(2) The resource generation supplied is less than
		    the one remembered, signalling either wrap or
		    complete cluster reset.
		(3) A Set for a generation that has already been done,
		    which may be a duplicate of a split-brained
		    reconfiguration.

	    When a Set operation completes, denying i/o from some
	    hosts, it must be guaranteed that all i/o from those hosts
	    is over, and absolutely no more will be done.  This may
	    mean waiting for the completion of all outstanding i/o
	    requests if true cancellation is not possible.

	errstat Get( in resourceName, out resourceSettings );

	    The get operation is a management convenience, to query
	    the state of a resource.  It may be issued at any time
	    by anyone.

	    FIXME -- should Get be restricted by cookie?

    OPEN: the current proposal does not address hung Set operations.

Expected use

    When the quorum group detects the death of a member, it uses GRITS
    to fence it off, by calling Set for all resources in use by the
    group, denying access by the deceased member.

    When the member comes back to life, and is granted access back
    into the group, the group uses another GRITS Set to re-enable
    access to the resources.

    It is up to the agent associated with the group to determine the
    exact access needed to particular resources.  It may be necessary
    to leave write access available to some resource that is used as
    part of group membership establishment, and/or quorum
    determination.

NATALIE - NFS Administration Extensions

    The NATALIE extensions to NFS provide additional network
    interfaces to perform manipulation of the live state of the NFS
    servers.  In particular, NATALIE supports forced eviction of
    clients that have directories mounted, and the cancellation of
    operations they may have in progress.  This is principally useful
    for cluster "fence off", but is administratively useful on its own
    merits in non-cluster environments.

    The main verbs are almost those used with the GRITS GroupService
    to Resource, with the followin exceptions: (1) The generation is
    not needed, as NATALIE is not specific to group membership, and
    (2) the mode is not allow or deny, but an NFS export access right,
    such as "rw".  The GRITS agent translating to NATALIE must do the
    appropriate mapping.

    We propose to use HTTP/S for NATALIE communication rather than
    ONC/RPC.  The two operations of interest would be encoded as
    GET or a POST operation.  The following variables may be sent
    with the query:

	secret		# the NFS control cookie
	sa		# sent action, "Allow Changes" "Get Current" "Change"

    If the returned page contains "<H2>ERROR", then an error was
    encountered during processing of the request.  If the page
    contains <H2>Success", then the operation succeeded.

    If "Change" is the sent action, the following variables describe
    the directories and access rights to be set.

	dir1		# directory name 1
	acc1		# access rights for dir1
	dir2		# directory name 2
	acc2		# access rights for dir2
	...
	dirN
	accN

	Where the accN variable is of the form:

	    accspec	    ::=	    nodespec | nodespec : nodespec
	    nodespec    ::=	    node = rights
	    rights	    ::=	    rw | ro    

	Example:

	    node1=rw:node2=ro:node3=rw

    If "Get Current" is the sent action, then the result page will
    contain a table having two columns, whose first column is a
    directory, and whose second is an access spec as shown above.

    Only these two sent actions, "Change" and "Get Current", need be
    recognized by a Natalie implementation.  The actual pages returned
    by a Natalie may have other actions at the discretion of the
    implementor.

    It is not specified whether NATALIE can grant more permission than
    are present in some static exports configuration, as this would
    constrain the NATALIE implementaton.  Enforcing this is desirable
    from a security perspective.

    When there are multiple and dynamic NFS servers on a machine, the
    NATALIE settings need to be coherently replicated across them all.
    That is, when a Set operation completes disallowing i/o from a
    host, it must be guaranteed that all i/o from that host is over,
    and no more will be done.

    Four plausible implementations of NATALIE follow:

    1.  At the time of a Set that changes current values, all nfsds
        and mountds are killed.  The exports file (or equivalent) is
        updated to reflect the permissions being set, and the daemons
        are restarted.

	The error propagation of this is questionable; survivors
	may receive errors that they ought not see.

        It is important that Natalie not be able to allow permissions
	that were not tolerated by the NFS configuration.  It is also
	important that the proper boot posture is established.  For
        use with GRITS, there would need to be some intertwining of
	the NFS shutdown with the SetMe query.

    2.  Have control plugs into the NFSDs to reject operations
	from fenced hosts.  This requires memory of the nodes
        being fenced.  You may also need hooks into mountd.

    3.  Create an NFS proxy, which implements the NATALIE filtering
        but otherwise forwards the request on to an unmodified
	NFS server.  This is inefficient, but totally generic.
	The NATALIE service need not even reside on the same
	node as the NFS service.  It should not, however,
	reside on any of the nodes being fenced!

    4.	Change the exports, then reboot the node.

    If a NATALIE service of type 2 or 3 comes up as configured by
    exports, and does not make the GRITS query, then a retried write
    of evicted/frozen node might be allowed.  This would be bad.  One
    solution is to put the NATALIE state in persistant storage on the
    server.  Another is to have the NATALIE be started by the GRITS
    agent after it queries the group about the boot posture.

Examples

    These examples pretend to use ONC as the communication
    mechanism.  Others may do as well.

    NFS/NATALIE Cluster
    -------------------

    A three node cluster A, B, C uses shared storage provided by
    service N, mounted from export point /path.  Together, the quorum
    of A, B, C form a virtual IP host V.

    Whenever there is a membership change in V, an agent receives
    the change and issues GRITS set commands to all the resources,

    On each of the nodes A, B or C, there is a configuration of GRITS
    resources used, perhaps in the following form:

	    % cat /etc/grits.groups

	    # GRITS resources used by group V
	    V	N   onc://N/grits/resource/path

    This tells the group-side code that to issue commands to the
    resource for N, one uses onc to that location.  It also
    says that at boot, only read access is required.

    On the resource providing N, the GRITS agent is configured:

	    % cat /etc/grits.resources

	    # GRITS groups using resource N

	    N	/path   onc://V/grits/group/V(none) \
			onc://V/grits/group/A(r) \
			onc://V/grits/group/B(r) \
			onc://V/grits/group/C(r) \

    This tells the resource side of GRITS that on booting, it may
    contact any of V, A, B, or C for current state, and that until it
    makes sucessful contact, it should allow only r access to the
    actual nodes.  There should never be requests from the virtual
    node V, so it is given no access.

    Shared SCSI Cluster
    -------------------

    Two nodes A and B have a shared SCSI bus arbitrating disk D,
    forming group G.  Each node runs both the membership and the scsi
    GRITS agent; there is no shared IP.

	    % cat /etc/grits.groups

	    # GRITS resources used by group V
	    G	D   onc://A/grits/resource/path
	    G	D   onc://B/grits/resource/path

	    % cat /etc/grits.resources

	    # GRITS groups using resources

	    D	/path	onc://V/grits/group/A(r) \
			onc://V/grits/group/B(r) \

Summary of open areas and problems
----------------------------------

OPEN    1.  Can we use fencing to resolve quorum?  How would that
	    actually work?  This is supported by having resources
	    challenge quorum state when they think a Set is
	    "suspicious", and discussed in the Appendix below.

CLOSED  2.  Do we need persistance in the resource agent to determine
	    the correct cluster generation to listen too?  This gets
	    particularly complicated during nested reconfigs, with
	    some delayed member believing it has quorum, when the
	    actual cluster has moved on beyond.

	    We think persistence is not needed, given the SetMe challenge
	    protocol.  Persistence will enhance performance by reducing
	    the number of unnecessary challenges.

OPEN    3.  It may be desirable to support configuration of explicit
	    hierarchies of fencing points, stopping at the lowest one that
	    will work rather than going all the way up to "shoot the node"

OPEN    4.  Error reporting and propagation have not been addressed.  If
	    an attempt to fence fails, what to we do?  This leads to
	    hierarchies above.

OPEN    5.  Finer granularity than node may be extremely desirable.
	    Doing so is difficult, seemingly requiring kernel level
	    hooks to attach "group" attributes to processes to be
	    attached to requests for resources; this involves getting
	    into device drivers, and gets very messy.

CLOSED  6.  Performance reasons forced a change to resourceSettings
	    in the Set command, so we can batch a bunch of requests
	    in one shot.  But we may still need to do Sets in parallel
	    if there are a lot of them to talk to, and we don't want
	    to serialize on their potentially lengthy responses.

	    Set now takes the full set of hosts in question.

CLOSED	7.  To resolve wrap, we were tempted to use some arithmetic
	    subtleties.  This isn't necessary with the challenge
	    protocol, and lets us be agnostic about the number of
	    bits in the generation number.

OPEN	8.  What ought to happen, globally, if some resource or
	    group hangs or doesn't respond quickly to a command
	    or challenge?

Appendix - Use of GRITS to resolve quorum
-----------------------------------------

    In some cases, the group membership service may have a hard
    time determining quorum.  One possible resolution is to use
    access to a quorum resource to resolve the dispute.  Access
    can be controlled with fencing, but this presents recursion
    problems, as GRITS insists the group have quorum before doing
    any set operations.

    It should be clear that the quorum device is very special --
    it is used to resolve quorum only, and that this is not the
    same thing as fencing off a large number of resources that
    need protection to prevent corruption.

    To model this situation in GRITS, we will say that there
    are two groups, "group", accessing the actual protected
    resources, and "group-manager", which decides meta-access
    to the quorum device.

	# GRITS groups using data1, data2, data3, ...

	group /data1   onc://V/grits/group/V(none) \
		       onc://V/grits/group/A(r) \
		       onc://V/grits/group/B(r)

	group /data2   onc://V/grits/group/V(none) \
		       onc://V/grits/group/A(r) \
		       onc://V/grits/group/B(r)

	group /data3   onc://V/grits/group/V(none) \
		       onc://V/grits/group/A(r) \
		       onc://V/grits/group/B(r)

	group-manager /quorum   onc://V/grits/group-manager/V(none) \
				onc://V/grits/group-manager/A(r) \
				onc://V/grits/group-manager/B(r)

    When there is a quorum dispute, the group service will attempt to
    write the /quorum device.  If it has access, it is in the quorum
    group.

    During a split-brain reconfiguration, nodes A and B want to claim
    quorum.  Not knowing of the other's existence, each attempts to
    fence out the other.  The first one in, say node A, will succeed
    in fencing B out.  Node B one may think it has quorum and issue
    its own Set, but this will be receive a SetMe challenge in
    response.  On receiving this, the B will try to access the quorum
    device, and fail because of the fencing put in place by node A.

    Now say the fence of B is in place, and A dies.  If the partition
    still exists, B won't know, and the system is dead.  If the
    partition is resolved, and B is notified, it must grab the quorum
    device in some way.  It will need to establish on its own that A
    is dead, issue a Set, and when challenged by SetMe, it will claim
    the device even though it is currently fenced out, and fence A out
    in the process.

    It is very difficult for B to distinguish between the cases where
    A is truly dead (and that B needs to step in), and those where the
    network partition remains in place, A is working, and B needs to
    stay dead.  These problems are left for group membership service
    writers to resolve; GRITS does not provide an easy solution.

Acknowledgements
----------------

    We'd like to thank the readers of the gfs-devel@borg.umn.edu
    and those on the linux-ha-dev@lists.tummy.com for their indulgence
    in hosting discussions of this proposal.  We'd also like to thank
    the following individuals who have provided significant feedback:

        Stephen C. Tweedie (sct@redhat.com)
	Phil Larson <Phil.Larson@netapp.com>

-- 
Butterflies tell me to say:
"The statements and opinions expressed here are my own and do not necessarily 
  represent those of Oracle Corporation."

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
http://lists.tummy.com/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[prev in list] [next in list] [prev in thread] [next in thread]