[prev in list] [next in list] [prev in thread] [next in thread]
List: openais
Subject: Re: [Openais] Failure during recovery
From: Ruppert Koch <ruppert () rcsc ! de>
Date: 2005-05-03 16:25:29
Message-ID: 4277A5F9.4040000 () rcsc ! de
[Download RAW message or body]
Hi Steve,
thanks for the welcome. The membership algorithm is a bit tricky. If at
least node has entered operational mode, all others have to follow. If
all nodes are still in recovery mode, all nodes must roll back. For the
correctness of the algorithm, it must be ensured that a node enters
operational mode only then when all nodes of the new ring have obtained
all recovered messages and, therefore, are able to enter operational
mode. This state is reached if my_retrans_flg_count >=3 and
my_rotation_counter = 2.
Unfortunately, it is impossible to decide whether a node P_i should
enter operational state before entering gather mode again or not. In
case communication was reliable, a join message sent by a node P_j that
holds a later position in the new ring (j>i) would mean that P_j and all
following nodes have not reached that state yet and, therefore, all
nodes must roll back. It is possible, however, that the join message
sent by P_j is a response to anther join message sent by a node P_k that
preceeds P_i (k<i).
The best way of solving this problem is to let the nodes decide whether
to roll back or not based on information in the join message. If the
ring sequence number was incremented when a node enters operational
state, the increased ring seq would indicate that the sender has
delivered the new membership and, therefore, the receiver must follow suit.
Join message from foreign groups must be ignored during that phase.
Nodes are allowed to react to foreign join messages after the last node
installed the new membership. This knowledge is present after the first
full token round has been completed after the last node installed the
membership.
Cheers,
Ruppert
Steven Dake wrote:
> Ruppert,
>
>welcome to the list. After reviewing your email and then the totem
>specification, it has become clear to me that the my_recieved_flg and
>received_flg in the commit token are there for the purpose of
>determining when recovery of a ring has completed, but the token may
>have been lost in the last rotation of the token before entering the
>operational state.
>
>Unfortunately when this code was originally implemented, I didn't
>understand the purpose of that flag. We can conditionally restore the
>old ring only if my_recieved_flg = false. If it is true, then recovery
>has completed, and there is no need to restore the ring since no
>messages will be ordered on it.
>
>Thank you for the bug report and jogging my brain. A little peer review
>is really a wonderful thing.
>
>If you have any other suggestions for the totem srp code we welcome
>them.
>
>regards
>-steve
>
>
--
Dr./USA Ruppert Koch
Reliable Computer Systems Consulting
Phone: +49 89 74326886, Mobile: +49 163 2862354
http://www.rcsc.de
_______________________________________________
Openais mailing list
Openais@lists.osdl.org
http://lists.osdl.org/mailman/listinfo/openais
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic