[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openais
Subject:    Re: [Openais] Failure during recovery
From:       Ruppert Koch <ruppert () rcsc ! de>
Date:       2005-05-03 16:25:29
Message-ID: 4277A5F9.4040000 () rcsc ! de
[Download RAW message or body]

Hi Steve,

thanks for the welcome. The membership algorithm is a bit tricky. If at 
least node has entered operational mode, all others have to follow. If 
all nodes are still in recovery mode, all nodes must roll back. For the 
correctness of the algorithm, it must be ensured that a node enters 
operational mode only then when all nodes of the new ring have obtained 
all recovered messages and, therefore, are able to enter operational 
mode. This state is reached if my_retrans_flg_count >=3 and 
my_rotation_counter = 2.

Unfortunately, it is impossible to decide whether a node P_i should 
enter operational state before entering gather mode again or not. In 
case communication was reliable, a join message sent by a node P_j that 
holds a later position in the new ring (j>i) would mean that P_j and all 
following nodes have not reached that state yet and, therefore, all 
nodes must roll back. It is possible, however, that the join message 
sent by P_j is a response to anther join message sent by a node P_k that 
preceeds P_i (k<i).

The best way of solving this problem is to let the nodes decide whether 
to roll back or not based on information in the join message. If the 
ring sequence number was incremented when a node enters operational 
state, the increased ring seq would indicate that the sender has 
delivered the new membership and, therefore, the receiver must follow suit.

Join message from foreign groups must be ignored during that phase. 
Nodes are allowed to react to foreign join messages after the last node 
installed the new membership. This knowledge is present after the first 
full token round has been completed after the last node installed the 
membership.

Cheers,
Ruppert


Steven Dake wrote:

> Ruppert,
>
>welcome to the list.  After reviewing your email and then the totem
>specification, it has become clear to me that the my_recieved_flg and
>received_flg in the commit token are there for the purpose of
>determining when recovery of a ring has completed, but the token may
>have been lost in the last rotation of the token before entering the
>operational state.
>
>Unfortunately when this code was originally implemented, I didn't
>understand the purpose of that flag.  We can conditionally restore the
>old ring only if my_recieved_flg = false.  If it is true, then recovery
>has completed, and there is no need to restore the ring since no
>messages will be ordered on it.
>
>Thank you for the bug report and jogging my brain.  A little peer review
>is really a wonderful thing.
>
>If you have any other suggestions for the totem srp code we welcome
>them.
>
>regards
>-steve
>  
>


-- 
Dr./USA Ruppert Koch
Reliable Computer Systems Consulting
Phone: +49 89 74326886, Mobile: +49 163 2862354
http://www.rcsc.de

_______________________________________________
Openais mailing list
Openais@lists.osdl.org
http://lists.osdl.org/mailman/listinfo/openais
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic