'[Lustre-devel] recovering from gateway failures'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lustre-devel
Subject:    [Lustre-devel] recovering from gateway failures
From:       Eric Barton <eric () bartonsoftware ! com>
Date:       2003-08-21 0:47:51
[Download RAW message or body]


When a single lustre filesystem runs on more than one cluster, the lustre
RPC messages must be forwarded between the clusters.  This forwarding takes
place on so-called gateway nodes.  For example, two elan clusters might be
connected to each other by gigE.  The nodes in both clusters that connect
to the gigE network(s) have to "shovel" lustre RPC messages between their
elan and gigE network connections (i.e. between the qswnal and socknal).

This mail is about how to exploit redundant gateways so that a lustre
filesystem spread on multiple clusters remains functional in the presence
of individual gateway failures.

At this time, I have implemented redundant gateways in the portals router.
This implements (a) load balancing over equivalent routes and (b) a means
to enable and disable particular gateways.  The idea is that when gateway
failure is detected, all relevent nodes disable routes which use that
gateway until the gateway is rebooted.  

The remaining issues are (a) how do we detect failure of a gateway and
trigger the disabling of routes that use it, (b) how do we detect the
return to service of a gateway and trigger re-enabling the routes that use
it and (c) how does this interact with lustre.

Note that this discussion restricts itself to multi-cluster configurations
in which only immediate network neighbours of gateways need to be notified
of failure; other configurations probably aren't relevent since portals
message forwarding is a big latency hit and such networks should be avoided
anyway.


A. Gateway Failure.

I think the best place to detect gateway failure is by the relevent NAL in
the immediate neighbours of the gateway. 

Currently the elan and socket NALs can detect peer failure reliably and
could (a) notify the local router that this gateway should be disabled and
(b) notify the world via an upcall.

Note that detecting peer failure necessarily involves one or more messages
(incoming or outgoing) getting dropped, so local router notification
ensures that all further messages can avoid the failed gateway and the
upcall can be used to pro-actively inform other nodes of the problem,
rather than letting them find out for themselves.  This should also avoid
any possibility of a "chained timeout" problem.


B. Gateway Reboot.

Spookily, the gateway itself knows when it has rebooted, so it seems
sensible that the script that configures lustre on the gateway should also
trigger notification on the network neighbours (of all the gateway's NALs),
rather than coming up with some clever scheme for detecting when the
gateway has returned to service on these neighbours.


C. Impact on Lustre.

When a gateway fails, at least 1 message gets dropped; think of it as
portals networking providing a very good (but obviously not perfect) "best
efforts" message delivery mechanism.

Whenever a message gets dropped a lustre client or server will time out.
This triggers lustre's "normal" recovery actions which will be able to
proceed immediately if the failed gateway has been disabled (and an
alternative route exists).

The key issue here is "Can we ensure that the failed gateway has been
disabled when lustre notices something amiss?"  

Unfortunately, the answer has to be "no", since (a) lustre timeouts can be
set arbitrarily with no reference to the NALs and (b) NALs can only
guarantee that messages complete (possibly with failure) in finite, but not
bounded time.

However this is of no consequence provided (a) lustre can insulate itself
from anything the network throws at it after it has abandoned a particular
RPC attempt and (b) lustre can recover from failed recovery attempts.

I think (a) is almost implemented already, since we unlink all ME/MDs
associated with an RPC after a timeout occurs.  However, we should ensure
we never re-use the same matchbits for the RPC reply, even on a retry of
the same transaction.  Personally I'd also like to _not_ reuse the same
matchbits for bulk, even if bulk is idempotent; IMHO it's just good
practice to use unique matchbits for each RPC attempt.  (I think this might
mean we have to change the protocol to stop overloading transaction number
and reply matchbits).

I'd also have thought that (b) should work right now since we can never
guarantee we don't experience failures during recovery in any case.

-- 

                Cheers,
                        Eric

----------------------------------------------------
|Eric Barton        Barton Software                |
|9 York Gardens     Tel:    +44 (117) 330 1575     |
|Clifton            Mobile: +44 (7909) 680 356     |
|Bristol BS8 4LL    Fax:    call first             |
|United Kingdom     E-Mail: eric@bartonsoftware.com|
----------------------------------------------------


-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines
at the same time. Free trial click here:http://www.vmware.com/wl/offer/358/0
_______________________________________________
Lustre-devel mailing list
Lustre-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lustre-devel
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic