'[Openais] Checkpoint Recovery Synchronization'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openais
Subject:    [Openais] Checkpoint Recovery Synchronization
From:       "Muni Bajpai" <muniba () nortel ! com>
Date:       2005-02-17 18:35:44
Message-ID: CFCE7C3BDB79204092974B5B50AD7194100290 () zrc2hxm0 ! corp ! nortel ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

Hey Steven,

So onto phase II. 

Do you have any preferences to the new (struct
req_exec_ckpt_checkpointsynchronize). I know you did mention having the
previous regular configuration ring_id in that message but what else ??
I know we have to send all the saCkptCheckpoint stored in the list that
checkpointListHead points to, or we could send out multiple synch for each
checkpoint. I prefer sending one message. But we 
have to decide on the type of the aggregated data.

Also the standard
struct req_header header;
struct message_source source;

should be a part of the new struct too.

I cant think of anything else.

Please let me know,

Thanks

Muni

-----Original Message-----
From: openais-bounces@lists.osdl.org [mailto:openais-bounces@lists.osdl.org]
On Behalf Of Bajpai, Muni [NGC:B670:EXCH]
Sent: Wednesday, February 16, 2005 1:31 PM
To: 'sdake@mvista.com'
Cc: openais@lists.osdl.org; markh@osdl.org; Smith, Kristen [NGC:B675:EXCH]
Subject: RE: [Openais] Checkpoint crash in aisexec

Ok steve, 

Thanks for the feedback. This is my take on the steps. 

I.) First Patch 
        1.) Move struct memb_ring_id from totemsrp.c to totemsrp.h 
        2.) Move #define MAX_MEMBERS from totemsrp.c to totemsrp.h, change
the name of the definition to PROCESSOR_COUNT_MAX.

        3.) Make changes to handlers.h, amf.c, ckpt.c, clm.c, evs.c,
totemsrp.c, totempg.c 

II.) Second Patch 
        Add support for sync on the ckpt service. 

Thanks 

Muni 
-----Original Message----- 
From: Steven Dake [mailto:sdake@mvista.com <mailto:sdake@mvista.com> ] 
Sent: Wednesday, February 16, 2005 1:02 PM 
To: Bajpai, Muni [NGC:B670:EXCH] 
Cc: openais@lists.osdl.org; Smith, Kristen [NGC:B675:EXCH]; markh@osdl.org 
Subject: RE: [Openais] Checkpoint crash in aisexec 

Muni 

I responded inline.  I'd suggest if you tackle this problem to try to break
it up into a few patches to work on seperately.  Ie: the configuration
change changes required to get the ring id through the config change system,
and then as a seperate patch the syncronization code.

Thanks 
-steve 

On Wed, 2005-02-16 at 09:38, Muni Bajpai wrote: 
> Thanks for the quick responses last evening. My Response/Queries are 
> inline prepended by a ------------------- 
> 
> Muni 
> 
> -----Original Message----- 
> From: Steven Dake [mailto:sdake@mvista.com <mailto:sdake@mvista.com> ] 
> Sent: Tuesday, February 15, 2005 6:20 PM 
> To: Bajpai, Muni [NGC:B670:EXCH]; openais@lists.osdl.org 
> Cc: Smith, Kristen [NGC:B675:EXCH]; markh@osdl.org 
> Subject: RE: [Openais] Checkpoint crash in aisexec 
> 
> 
> Muni 
> I hope you dont mind me copying the openais mailing list so others can 
> share in our exchanges. 
> 
> Thanks for taking a look at this 
> 
> Responses inline 
> 
> On Tue, 2005-02-15 at 14:54, Muni Bajpai wrote: 
> > Hey Steve, 
> > 
> > I work with kristen and need some more info on the checkpoint 
> recovery 
> > ... 
> > 
> > 1.) So the logic for accepting a configuration change from a 
> processor 
> > is : 
> >         if ((incoming_ring_id == last_known_ring_id) 
> >                 && (source_processor != delivering_processor) { 
> > 
> >                 //IGNORE Change. 
> >         } 
> > 
> >         So as per my understanding: 
> >         1.) (Ckpt Executive Perspective) If the change is from ME 
> then 
> > always change 
> 
> maybe I was wrong with what I said before.  Try this logic out: 
> 
> If the sync message is from your previous configuration, then the 
> reference counts should not be updated because they would double the 
> reference counts incorrectly. 
> 
> ------------- So you mean don't care about the source/dest of the sync 
> message for decision making of accepting/ignoring config_chg, just use 
> the ring_id ? 
> 

Its not the decision to accept the config change callback, its the decision
to accept the syncronization message.  You should always accept the
configuration change callback.  But in some cases, the sync message should
be ignored.

A member of the synchronization message should be "previous_ring_id" which
is the ring identifier of the ring previous to the one that is currently
undergoing recovery.  Keep in mind that it should be the last regular
configuration, not the transitional configuration.

The previous ring id is sufficient to determine if the refcount increase
request would result in an invalid increase.  

If they match, then the processor is already aware of the synchronization
contents and should ignore the request.  If they dont match, then the
processor is unaware of the syncronization contents and should accept the
request.

> ? 
> 
> ------------- Is it possible to get sync's from 2 different processors 
> with the same ring_id ?? 
> 

No this is not possible.  

The reason is that when determining to send the sync message, the old ring
id's representative is checked against the local ip.  If they match, then
the sync message is sent (because this processor is the representative).  If
they don't match, no sync message is sent (because the representative will
take care of requesting the synchronization message).

> The sync message is originated from the representative processor 
> containing the ring id prior to the transitional configuration change. 
> 
> When the message is delivered, it is compared to the ring id prior to 
> the transitional configuration.  If these two match, then the message 
> should be ignored because its a sync message from a processor within 
> the prior configuration. 
> 
> >         2.) if the ring_id's don't match then always change. 
> > 
> 
> Yes if the ring id in the delivered sync message doesn't match the 
> previous ring id, then add the reference count information for that 
> processor to the checkpoint. 
> 
> >         Please confirm. 
> > 
> > 2.) We must add support for the new data structure additions in the 
> > Ckpt Executive Opens and Close handlers also. 
> > 
> 
> no data structures are required in the handler prototypes.  I think we 
> need a new message vs open and close.  The message should be something 
> like "synchronizecounts".  I dont want to overload open and close too 
> much with extra meaning.  We could use this synchronizecounts for some 
> other purpose later, like exchanging metadata too. 
> 
> ------------ So the ckpt_refcount[MAX_MEMBERS] array is modified on 
> the receipt of sync messages,open and close?? 
> 

Yes ckpt_refcount is modified on open, close, and in some cases on sync
given the logic above. 

> > 3.) The addition as you enumerated to the checkpoint data structure, 
> > did you have any implementation preferences or did you want us to 
> use 
> > anything appropriates (cursively I was thinking of a list of struct 
> > refs) 
> 
> hmm I have an affinity towards avoiding any sort of memory allocation 
> if at all possible (because they can fail, and this can cause us major 
> troubles).  Maybe something like struct ckpt_refcnt { 
> 
>         int count; 
>         struct in_addr addr; 
> }; 
> 
> Then somethign like adding to saCkptCheckpoint 
> 
> struct ckpt_refcount ckpt_refcount[MAX_MEMBERS]; 
> 
> MAX_MEMBERS should probably be brought out fromt otemsrp.c into 
> totemsrp.h and changed from MAX_MEMBERS to PROCESSOR_COUNT_MAX. 
> 
> > 
> > 4.) The last_known_ring_id. What does that mean to a newly added 
> > processor. Explicitly ( incoming_ring_id == last_known_ring_id ) 
> will 
> > always fail on a newly commissioned processor. Am I understanding 
> that 
> > correctly ? 
> > 
> 
> no not incoming ring id.  Instead it is the processor's last ring id 
> in the originated synchronization message. 
> 
> last known ring id should be inited to zero.  You understand that the 
> sync message will have some value and last_known_ring_id will be zero. 
> 
> This will force the synchronization message to be accepted which is 
> desired behavior. 
> 
> > Where is the last_known_ring_id stored ? 
> > 
> 
> it must be stored when a configuration change is delivered to the 
> ckpt_confchg_fn. 
> 
> > 5.) Is exec/evt.c the best example for any ideas on implementation 
> ?? 
> > 
> 
> I don't think evt uses reference counting to track channels, but it is 
> necessary for checkpoints because of checkpoint retention.  I'd rather 
> try to invent a few different approaches here so we can unify them 
> later once we have discovered the best design. 
> 
> Synchronization after a merge or partition is the hardest part of a 
> distributed system and I hope we can find a few approaches to test 
> out. 
> 
> > 
> > Thanks 
> > 
> > Muni 
> > 
> > -----Original Message----- 
> > From: Steven Dake [mailto:sdake@mvista.com <mailto:sdake@mvista.com> ] 
> > Sent: Tuesday, February 15, 2005 1:51 PM 
> > To: Smith, Kristen [NGC:B675:EXCH] 
> > Cc: markh@osdl.org; openais@lists.osdl.org; Bajpai, Muni 
> > [NGC:B670:EXCH] 
> > Subject: RE: [Openais] Checkpoint crash in aisexec 
> > 
> > 
> > On Tue, 2005-02-15 at 09:47, Kristen Smith wrote: 
> > > Steve, 
> > > 
> > > Thanks for the response - I hear ya loud and clear - not good 
> > without 
> > > recovery. So, is there something that we could do to help you with 
> > > this recovery coding? If you had some type of design thoughts on 
> how 
> > > you wanted checkpoint recovery to occur, maybe that is something 
> we 
> > > could help out with. Just throwing this out there to see what you 
> > > think. 
> > > 
> > 
> > Kristen 
> > You have done alot to help us so far but more help is always 
> > appreciated 
> > :) 
> > 
> > If someone from your org wanted to get started writing code for 
> > checkpoint recovery that would be great!  I spent some time in the 
> > drive to work this morning thinking about how checkpoint recovery 
> > should work: 
> > 
> > There are 3 main steps that should be done in order: 
> > 1. synchronize checkpoint reference counts (so retention timers work 
> > properly) 
> > 2. synchronize checkpoint metadata contents (sizes, sections, etc) 
> 2. 
> > synchronize checkpoint section data contents 
> > 
> > The place to get started is on the reference count synchronization. 
> > 
> > The checkpoint must contain a list of active user's processor ids 
> > along with their reference count.  So if processor A has checkpoint 
> 1 
> > open twice, and processor B has checkpoint 1 open three times, and 
> > processor C has checkpoint 1 open four times each processor would 
> > maintain a list for the checkpoint (in the checkpoint data 
> structure): 
> > 
> > p_A:r_2 
> > p_B:r_3 
> > p_C:r_4 
> > 
> > Then on a configuration change, the leaving processors would close 
> > their reference counts.  So in this example, p_B leaves then the 
> > processor ref count looks like: p_A:r_2 p_C:r_4 
> > 
> > During this configuration change, a processor joins p_D.  It has 
> > checkpoint 1 open 1 time.  p_D gets a configuration change {add p_A, 
> > p_C} and then sends a synchronization message with its previous ring 
> > identifier and current list of checkpoint reference counts (after 
> the 
> > above leave in the configuration change was processed).  The 
> > representative of {p_A, p_C} also sends a synchronization message 
> with 
> > the previous ring identifier and a current list of checkpoint 
> > reference counts.  If the previous ring identifiers match and the 
> > sending processor is not the delivering processor then p_C should 
> > ignore p_A's message (ie: p_C receives p_A message, but it already 
> > knows about p_A's references). 
> > 
> > This requires us to add the ring identifier to the configuration 
> > change. 
> > 
> > So now each previous configuration is aware of the new 
> configuration. 
> > The reference counts look like: 
> > p_A:r_2 
> > p_C:r_4 
> > p_D:r_1 
> > 
> > The above maintenence of the reference counts, or open checkpoints, 
> > must maintain a per-checkpoint variable which is the "reference 
> count 
> > for this checkpoint".  In the last case, that reference count would 
> be 
> > 7. 
> > 
> > Each time a processor leaves, its reference counts are subtracted 
> from 
> > this "global ref count".  Each time a processor is added, its 
> > reference counts are added.  This reference count is then what is 
> used 
> > for retention duration. 
> > 
> > Any thoughts on the above approach welcome. 
> > 
> > Thanks! 
> > -steve 
> > 
> > > Thanks, 
> > > Kristen 
> > > 
> > > -----Original Message----- 
> > > From: Steven Dake [mailto:sdake@mvista.com <mailto:sdake@mvista.com> ]

> > > Sent: Monday, February 14, 2005 2:17 PM 
> > > To: Smith, Kristen [NGC:B675:EXCH]; markh@osdl.org; 
> > > openais@lists.osdl.org 
> > > Cc: Bajpai, Muni [NGC:B670:EXCH] 
> > > Subject: RE: [Openais] Checkpoint crash in aisexec 
> > > 
> > > 
> > > On Sat, 2005-02-12 at 08:08, Kristen Smith wrote: 
> > > > Steve, 
> > > > 
> > > > Thanks for the response. 
> > > > 
> > > > For recovery - what are the ramifications if we don't have 
> > recovery 
> > > > working 100%? What I see now is that when a node leaves the 
> > cluster 
> > > > and then rejoins, it receives evt messages, but it can take 
> > anywhere 
> > > > from 15seconds to minutes for evt messages sent from that node 
> to 
> > > > reach the other applications. I handle this with some 
> > > 
> > > Mark have you seen this issue? 
> > > 
> > > > message retries which is ok in this startup case. However, are 
> we 
> > in 
> > > > jeopardy in other cases that I am not considering? When running 
> > > > traffic the past few days and seeing periodic reconfigs, I don't 
> > > seem 
> > > > to be losing messages when that occurs - I only see the lost 
> > > messages 
> > > > when I actually kill a node and start it back up to rejoin the 
> > > > cluster. 
> > > > 
> > > 
> > > What we have today is totally unacceptable because atleast for 
> > > checkpointing, there is no recovery.  And Mark is waiting on my 
> base 
> > > code for event recovery. 
> > > 
> > > Definition of 100% working means if there is a failure during 
> > > recovery, we are guaranteed a consistent state.  I think evt is 
> > pretty 
> > > close to this goal, although the checkpoint replication after 
> merge 
> > > has not been developed yet.  I can think of alot of easy ways to 
> do 
> > > this, but handling a failure during the recovery phase makes it 
> more 
> > > difficult. 
> > > 
> > > Definition of almost 100% is that recovery works properly if there 
> > are 
> > > no faults during recovery (ie: the merge process), but if there is 
> a 
> > > fault during recovery (ie: reconfig) something could go awry. 
> > > 
> > > We want consistently replicated data (the 100% case).  100% is 
> > > probably past your development window; the other case is within 
> > reach. 
> > > 
> > > Regards 
> > > -steve 
> > > 
> > > > Thanks 
> > > > Kristen 
> > > > 
> > > > -----Original Message----- 
> > > > From: Steven Dake [mailto:sdake@mvista.com <mailto:sdake@mvista.com>
] 
> > > > Sent: Friday, February 11, 2005 5:30 PM 
> > > > To: Smith, Kristen [NGC:B675:EXCH] 
> > > > Subject: RE: [Openais] Checkpoint crash in aisexec 
> > > > 
> > > > 
> > > > Ok well I doubt with 200 byte checkpoints there is a buffer 
> > > overflow. 
> > > > :) 
> > > > 
> > > > Recovery will come after 188 is wrapped up.  I think your two 
> > weeks 
> > > > window looks good for alpha-level recovery (ie: works most of 
> the 
> > > > time).  High quality production recovery will not hit your 
> window 
> > > for 
> > > > development (ie: works 100% of the time no matter what happens). 
> > > > 
> > > > Thanks 
> > > > -steve 
> > > > 
> > > > On Fri, 2005-02-11 at 15:56, Kristen Smith wrote: 
> > > > > Steve, 
> > > > > 
> > > > > The size of the checkpoints are ~200 bytes. 
> > > > > 
> > > > > I agree, valgrind is an excellent tool. We will run it through 
> > and 
> > > > see 
> > > > > if that shows anything. 
> > > > > 
> > > > > I have tried this scenario maybe 30 times today (for various 
> > other 
> > > > > testing) and it happened maybe 10 times. For a while I could 
> > > > reproduce 
> > > > > with a given test about 5 times and then it hasn't happened 
> > again. 
> > > > > 
> > > > > Sounds like defect-188 fixing is going well. May I ask how the 
> > > > > recovery work is going as well? (Don't mean to be pushy on 
> that 
> > > > front 
> > > > > - we have 2 more weeks of coding for our application left and 
> I 
> > am 
> > > > > really hoping that we are able to put the new recovery code in 
> > > > during 
> > > > > that time). 
> > > > > 
> > > > > Thanks a bunch, 
> > > > > Kristen 
> > > > > 
> > > > > -----Original Message----- 
> > > > > From: Steven Dake [mailto:sdake@mvista.com
<mailto:sdake@mvista.com> ] 
> > > > > Sent: Friday, February 11, 2005 4:37 PM 
> > > > > To: Smith, Kristen [NGC:B675:EXCH] 
> > > > > Subject: Re: [Openais] Checkpoint crash in aisexec 
> > > > > 
> > > > > 
> > > > > how large are the read or write requests? 
> > > > > just a thought there could be some buffer overrun with larger 
> > > > > requests. 
> > > > > 
> > > > > On Fri, 2005-02-11 at 14:55, Kristen Smith wrote: 
> > > > > > Steve, 
> > > > > > 
> > > > > > We are periodically seeing aisexec crash with the following 
> > > trace: 
> > > > > > 
> > > > > >         (gdb) bt 
> > > > > >         #0  message_handler_req_lib_ckpt_checkpointclose 
> > > > > >         (conn_info=0x0, message=0xb73fc008) at ckpt.c:1552 
> > > > > >         #1  0x080494c2 in poll_handler_libais_deliver 
> > (handle=0, 
> > > > > fd=3, 
> > > > > >         revent=134633824, data=0x89c2ad8, 
> > > > > >             prio=0x89b2784) at main.c:578 
> > > > > >         #2  0x08056e62 in poll_run (handle=0) at 
> aispoll.c:386 
> > > > > > 
> > > > > > 
> > > > > > #3  0x080499ac in main (argc=1, argv=0xbfffcb64) at 
> > main.c:1003 
> > > > > > 
> > > > > > We have looked through the code but can't seem to figure out 
> > how 
> > > > > > conn_info is getting set to 0. Do you have any idea under 
> what 
> > > > > > circumstances conn_info could be null when this function is 
> > > > called? 
> > > > > > 
> > > > > > This is happening when we have multiple nodes up and we kill 
> > one 
> > > > of 
> > > > > > the active nodes. The standby node (which was reading 
> > > checkpoints) 
> > > > > > must now become a writer, so it closes the checkpoint and 
> this 
> > > > > > happens. Unfortunately, I can't reproduce this consistently 
> - 
> > I 
> > > > > > finally got a core dump today. I don't recall ever seeing 
> this 
> > > > with 
> > > > > > the old code. 
> > > > > > 
> > > > > > Thanks, 
> > > > > > Kristen 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> ______________________________________________________________________ 
> > > > > > _______________________________________________ 
> > > > > > Openais mailing list 
> > > > > > Openais@lists.osdl.org 
> > > > > http://lists.osdl.org/mailman/listinfo/openais
<http://lists.osdl.org/mailman/listinfo/openais>  
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 

[Attachment #5 (text/html)]

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<TITLE>Message</TITLE>

<META content="MSHTML 6.00.2800.1491" name=GENERATOR></HEAD>
<BODY>
<DIV><SPAN class=086332418-17022005><FONT face=Arial size=2>Hey 
Steven,</FONT></SPAN></DIV>
<DIV><SPAN class=086332418-17022005><FONT face=Arial 
size=2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=086332418-17022005><FONT face=Arial size=2>So onto phase II. 
</FONT></SPAN></DIV>
<DIV><SPAN class=086332418-17022005><FONT face=Arial 
size=2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT face=Arial 
size=2><FONT color=#000000>Do you have any preferences to the new</FONT> <FONT 
color=#000000>(</FONT></FONT><B><FONT color=#7f0055><FONT face=Arial 
size=2>struct</FONT></B></FONT><FONT face=Arial color=#000000 size=2> 
req_exec_ckpt_checkpoint<SPAN class=086332418-17022005>synchronize). I know you 
did mention having the previous regular configuration ring_id in that message 
but what else ??</SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial color=#000000 size=2>I know we have to 
send all the </FONT><FONT face=Arial><FONT color=#000000><FONT 
size=2>saCkptCheckpoin<SPAN class=086332418-17022005>t stored in the list that 
<FONT size=2>checkpointListHead points to, or we could send out multiple synch 
for each checkpoint. I prefer sending one message. But we 
</FONT></SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005>have to decide on the type of the aggregated 
data.</SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005></SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005>Also the 
standard</SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005></SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN><SPAN 
class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005><B><FONT color=#7f0055 size=2>struct</B></FONT><FONT 
size=2> req_header 
header;</FONT></SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005><B><FONT color=#7f0055 size=2>struct</B></FONT><FONT 
size=2> message_source 
source;</FONT></SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005><FONT 
size=2></FONT></SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005><FONT size=2><FONT color=#0000ff><FONT 
color=#000000>should be a part of the new struct 
too</FONT>.</FONT></DIV></FONT></SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005></SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005>I cant think of anything 
else.</SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005></SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005>Please let me 
know,</SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005></SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005>Thanks</SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005></SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT><SPAN 
class=086332418-17022005><FONT face=Arial><FONT color=#000000><FONT size=2><SPAN 
class=086332418-17022005>Muni</SPAN></FONT></FONT></FONT></SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT face=Arial 
color=#000000 size=2><SPAN 
class=086332418-17022005></SPAN></FONT></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=086332418-17022005><FONT color=#0000ff><FONT face=Arial 
color=#000000 size=2><SPAN 
class=086332418-17022005></SPAN></FONT></FONT></SPAN>&nbsp;</DIV>
<BLOCKQUOTE DEFANGED_style="MARGIN-RIGHT: 0px">
  <DIV></DIV>
  <DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left><FONT 
  face=Tahoma size=2>-----Original Message-----<BR><B>From:</B> 
  openais-bounces@lists.osdl.org [mailto:openais-bounces@lists.osdl.org] <B>On 
  Behalf Of </B>Bajpai, Muni [NGC:B670:EXCH]<BR><B>Sent:</B> Wednesday, February 
  16, 2005 1:31 PM<BR><B>To:</B> 'sdake@mvista.com'<BR><B>Cc:</B> 
  openais@lists.osdl.org; markh@osdl.org; Smith, Kristen 
  [NGC:B675:EXCH]<BR><B>Subject:</B> RE: [Openais] Checkpoint crash in 
  aisexec<BR><BR></FONT></DIV>
  <P><FONT size=2>Ok steve,</FONT> </P>
  <P><FONT size=2>Thanks for the feedback. This is my take on the steps.</FONT> 
  </P>
  <P><FONT size=2>I.) First Patch</FONT> 
  <BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT size=2>1.) Move struct 
  memb_ring_id from totemsrp.c to totemsrp.h</FONT> 
  <BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT size=2>2.) Move #define 
  MAX_MEMBERS from totemsrp.c to totemsrp.h, change the name of the definition 
  to PROCESSOR_COUNT_MAX.</FONT></P>
  <P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT size=2>3.) Make changes to 
  handlers.h, amf.c, ckpt.c, clm.c, evs.c, totemsrp.c, totempg.c</FONT> </P>
  <P><FONT size=2>II.) Second Patch</FONT> 
  <BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT size=2>Add support for 
  sync on the ckpt service.</FONT> </P>
  <P><FONT size=2>Thanks</FONT> </P>
  <P><FONT size=2>Muni</FONT> <BR><FONT size=2>-----Original Message-----</FONT> 
  <BR><FONT size=2>From: Steven Dake [<A 
  href="mailto:sdake@mvista.com">mailto:sdake@mvista.com</A>] </FONT><BR><FONT 
  size=2>Sent: Wednesday, February 16, 2005 1:02 PM</FONT> <BR><FONT size=2>To: 
  Bajpai, Muni [NGC:B670:EXCH]</FONT> <BR><FONT size=2>Cc: 
  openais@lists.osdl.org; Smith, Kristen [NGC:B675:EXCH]; markh@osdl.org</FONT> 
  <BR><FONT size=2>Subject: RE: [Openais] Checkpoint crash in aisexec</FONT> 
  </P><BR>
  <P><FONT size=2>Muni</FONT> </P>
  <P><FONT size=2>I responded inline.&nbsp; I'd suggest if you tackle this 
  problem to try to break it up into a few patches to work on seperately.&nbsp; 
  Ie: the configuration change changes required to get the ring id through the 
  config change system, and then as a seperate patch the syncronization 
  code.</FONT></P>
  <P><FONT size=2>Thanks</FONT> <BR><FONT size=2>-steve</FONT> </P>
  <P><FONT size=2>On Wed, 2005-02-16 at 09:38, Muni Bajpai wrote:</FONT> 
  <BR><FONT size=2>&gt; Thanks for the quick responses last evening. My 
  Response/Queries are </FONT><BR><FONT size=2>&gt; inline prepended by a 
  -------------------</FONT> <BR><FONT size=2>&gt; </FONT><BR><FONT size=2>&gt; 
  Muni</FONT> <BR><FONT size=2>&gt; </FONT><BR><FONT size=2>&gt; -----Original 
  Message-----</FONT> <BR><FONT size=2>&gt; From: Steven Dake [<A 
  href="mailto:sdake@mvista.com">mailto:sdake@mvista.com</A>]</FONT> <BR><FONT 
  size=2>&gt; Sent: Tuesday, February 15, 2005 6:20 PM</FONT> <BR><FONT 
  size=2>&gt; To: Bajpai, Muni [NGC:B670:EXCH]; openais@lists.osdl.org</FONT> 
  <BR><FONT size=2>&gt; Cc: Smith, Kristen [NGC:B675:EXCH]; 
  markh@osdl.org</FONT> <BR><FONT size=2>&gt; Subject: RE: [Openais] Checkpoint 
  crash in aisexec</FONT> <BR><FONT size=2>&gt; </FONT><BR><FONT size=2>&gt; 
  </FONT><BR><FONT size=2>&gt; Muni</FONT> <BR><FONT size=2>&gt; I hope you dont 
  mind me copying the openais mailing list so others can </FONT><BR><FONT 
  size=2>&gt; share in our exchanges.</FONT> <BR><FONT size=2>&gt; 
  </FONT><BR><FONT size=2>&gt; Thanks for taking a look at this</FONT> <BR><FONT 
  size=2>&gt; </FONT><BR><FONT size=2>&gt; Responses inline</FONT> <BR><FONT 
  size=2>&gt; </FONT><BR><FONT size=2>&gt; On Tue, 2005-02-15 at 14:54, Muni 
  Bajpai wrote:</FONT> <BR><FONT size=2>&gt; &gt; Hey Steve,</FONT> <BR><FONT 
  size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; I work with kristen and 
  need some more info on the checkpoint</FONT> <BR><FONT size=2>&gt; 
  recovery</FONT> <BR><FONT size=2>&gt; &gt; ...</FONT> <BR><FONT size=2>&gt; 
  &gt; </FONT><BR><FONT size=2>&gt; &gt; 1.) So the logic for accepting a 
  configuration change from a</FONT> <BR><FONT size=2>&gt; processor</FONT> 
  <BR><FONT size=2>&gt; &gt; is :</FONT> <BR><FONT size=2>&gt; 
  &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ((incoming_ring_id == 
  last_known_ring_id) </FONT><BR><FONT size=2>&gt; 
  &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
  &amp;&amp; (source_processor != delivering_processor) {</FONT> <BR><FONT 
  size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; 
  &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
  //IGNORE Change.</FONT> <BR><FONT size=2>&gt; 
  &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</FONT> <BR><FONT 
  size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; 
  &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; So as per my 
  understanding:</FONT> <BR><FONT size=2>&gt; 
  &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.) (Ckpt Executive 
  Perspective) If the change is from ME</FONT> <BR><FONT size=2>&gt; then</FONT> 
  <BR><FONT size=2>&gt; &gt; always change</FONT> <BR><FONT size=2>&gt; 
  </FONT><BR><FONT size=2>&gt; maybe I was wrong with what I said before.&nbsp; 
  Try this logic out:</FONT> <BR><FONT size=2>&gt; </FONT><BR><FONT size=2>&gt; 
  If the sync message is from your previous configuration, then the 
  </FONT><BR><FONT size=2>&gt; reference counts should not be updated because 
  they would double the </FONT><BR><FONT size=2>&gt; reference counts 
  incorrectly.</FONT> <BR><FONT size=2>&gt; </FONT><BR><FONT size=2>&gt; 
  ------------- So you mean don't care about the source/dest of the sync 
  </FONT><BR><FONT size=2>&gt; message for decision making of accepting/ignoring 
  config_chg, just use </FONT><BR><FONT size=2>&gt; the ring_id ?</FONT> 
  <BR><FONT size=2>&gt; </FONT></P>
  <P><FONT size=2>Its not the decision to accept the config change callback, its 
  the decision to accept the syncronization message.&nbsp; You should always 
  accept the configuration change callback.&nbsp; But in some cases, the sync 
  message should be ignored.</FONT></P>
  <P><FONT size=2>A member of the synchronization message should be 
  "previous_ring_id" which is the ring identifier of the ring previous to the 
  one that is currently undergoing recovery.&nbsp; Keep in mind that it should 
  be the last regular configuration, not the transitional 
  configuration.</FONT></P>
  <P><FONT size=2>The previous ring id is sufficient to determine if the 
  refcount increase request would result in an invalid increase.&nbsp; 
  </FONT></P>
  <P><FONT size=2>If they match, then the processor is already aware of the 
  synchronization contents and should ignore the request.&nbsp; If they dont 
  match, then the processor is unaware of the syncronization contents and should 
  accept the request.</FONT></P>
  <P><FONT size=2>&gt; ?</FONT> <BR><FONT size=2>&gt; </FONT><BR><FONT 
  size=2>&gt; ------------- Is it possible to get sync's from 2 different 
  processors </FONT><BR><FONT size=2>&gt; with the same ring_id ??</FONT> 
  <BR><FONT size=2>&gt; </FONT></P>
  <P><FONT size=2>No this is not possible.&nbsp; </FONT></P>
  <P><FONT size=2>The reason is that when determining to send the sync message, 
  the old ring id's representative is checked against the local ip.&nbsp; If 
  they match, then the sync message is sent (because this processor is the 
  representative).&nbsp; If they don't match, no sync message is sent (because 
  the representative will take care of requesting the synchronization 
  message).</FONT></P>
  <P><FONT size=2>&gt; The sync message is originated from the representative 
  processor </FONT><BR><FONT size=2>&gt; containing the ring id prior to the 
  transitional configuration change.</FONT> <BR><FONT size=2>&gt; 
  </FONT><BR><FONT size=2>&gt; When the message is delivered, it is compared to 
  the ring id prior to </FONT><BR><FONT size=2>&gt; the transitional 
  configuration.&nbsp; If these two match, then the message </FONT><BR><FONT 
  size=2>&gt; should be ignored because its a sync message from a processor 
  within </FONT><BR><FONT size=2>&gt; the prior configuration.</FONT> <BR><FONT 
  size=2>&gt; </FONT><BR><FONT size=2>&gt; 
  &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.) if the ring_id's 
  don't match then always change.</FONT> <BR><FONT size=2>&gt; &gt; 
  </FONT><BR><FONT size=2>&gt; </FONT><BR><FONT size=2>&gt; Yes if the ring id 
  in the delivered sync message doesn't match the </FONT><BR><FONT size=2>&gt; 
  previous ring id, then add the reference count information for that 
  </FONT><BR><FONT size=2>&gt; processor to the checkpoint.</FONT> <BR><FONT 
  size=2>&gt; </FONT><BR><FONT size=2>&gt; 
  &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Please confirm.</FONT> 
  <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; 2.) We must add 
  support for the new data structure additions in the</FONT> <BR><FONT 
  size=2>&gt; &gt; Ckpt Executive Opens and Close handlers also.</FONT> 
  <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; </FONT><BR><FONT 
  size=2>&gt; no data structures are required in the handler prototypes.&nbsp; I 
  think we </FONT><BR><FONT size=2>&gt; need a new message vs open and 
  close.&nbsp; The message should be something </FONT><BR><FONT size=2>&gt; like 
  "synchronizecounts".&nbsp; I dont want to overload open and close too 
  </FONT><BR><FONT size=2>&gt; much with extra meaning.&nbsp; We could use this 
  synchronizecounts for some </FONT><BR><FONT size=2>&gt; other purpose later, 
  like exchanging metadata too.</FONT> <BR><FONT size=2>&gt; </FONT><BR><FONT 
  size=2>&gt; ------------ So the ckpt_refcount[MAX_MEMBERS] array is modified 
  on </FONT><BR><FONT size=2>&gt; the receipt of sync messages,open and 
  close??</FONT> <BR><FONT size=2>&gt; </FONT></P>
  <P><FONT size=2>Yes ckpt_refcount is modified on open, close, and in some 
  cases on sync given the logic above.</FONT> </P>
  <P><FONT size=2>&gt; &gt; 3.) The addition as you enumerated to the checkpoint 
  data structure, </FONT><BR><FONT size=2>&gt; &gt; did you have any 
  implementation preferences or did you want us to</FONT> <BR><FONT size=2>&gt; 
  use</FONT> <BR><FONT size=2>&gt; &gt; anything appropriates (cursively I was 
  thinking of a list of struct</FONT> <BR><FONT size=2>&gt; &gt; refs)</FONT> 
  <BR><FONT size=2>&gt; </FONT><BR><FONT size=2>&gt; hmm I have an affinity 
  towards avoiding any sort of memory allocation </FONT><BR><FONT size=2>&gt; if 
  at all possible (because they can fail, and this can cause us major 
  </FONT><BR><FONT size=2>&gt; troubles).&nbsp; Maybe something like struct 
  ckpt_refcnt {</FONT> <BR><FONT size=2>&gt; </FONT><BR><FONT 
  size=2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; int count;</FONT> 
  <BR><FONT size=2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; struct 
  in_addr addr;</FONT> <BR><FONT size=2>&gt; };</FONT> <BR><FONT size=2>&gt; 
  </FONT><BR><FONT size=2>&gt; Then somethign like adding to 
  saCkptCheckpoint</FONT> <BR><FONT size=2>&gt; </FONT><BR><FONT size=2>&gt; 
  struct ckpt_refcount ckpt_refcount[MAX_MEMBERS];</FONT> <BR><FONT size=2>&gt; 
  </FONT><BR><FONT size=2>&gt; MAX_MEMBERS should probably be brought out fromt 
  otemsrp.c into </FONT><BR><FONT size=2>&gt; totemsrp.h and changed from 
  MAX_MEMBERS to PROCESSOR_COUNT_MAX.</FONT> <BR><FONT size=2>&gt; 
  </FONT><BR><FONT size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; 4.) The 
  last_known_ring_id. What does that mean to a newly added</FONT> <BR><FONT 
  size=2>&gt; &gt; processor. Explicitly ( incoming_ring_id == 
  last_known_ring_id )</FONT> <BR><FONT size=2>&gt; will</FONT> <BR><FONT 
  size=2>&gt; &gt; always fail on a newly commissioned processor. Am I 
  understanding</FONT> <BR><FONT size=2>&gt; that</FONT> <BR><FONT size=2>&gt; 
  &gt; correctly ?</FONT> <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT 
  size=2>&gt; </FONT><BR><FONT size=2>&gt; no not incoming ring id.&nbsp; 
  Instead it is the processor's last ring id </FONT><BR><FONT size=2>&gt; in the 
  originated synchronization message.</FONT> <BR><FONT size=2>&gt; 
  </FONT><BR><FONT size=2>&gt; last known ring id should be inited to 
  zero.&nbsp; You understand that the </FONT><BR><FONT size=2>&gt; sync message 
  will have some value and last_known_ring_id will be zero.</FONT> <BR><FONT 
  size=2>&gt; </FONT><BR><FONT size=2>&gt; This will force the synchronization 
  message to be accepted which is </FONT><BR><FONT size=2>&gt; desired 
  behavior.</FONT> <BR><FONT size=2>&gt; </FONT><BR><FONT size=2>&gt; &gt; Where 
  is the last_known_ring_id stored ?</FONT> <BR><FONT size=2>&gt; &gt; 
  </FONT><BR><FONT size=2>&gt; </FONT><BR><FONT size=2>&gt; it must be stored 
  when a configuration change is delivered to the </FONT><BR><FONT size=2>&gt; 
  ckpt_confchg_fn.</FONT> <BR><FONT size=2>&gt; </FONT><BR><FONT size=2>&gt; 
  &gt; 5.) Is exec/evt.c the best example for any ideas on implementation</FONT> 
  <BR><FONT size=2>&gt; ??</FONT> <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT 
  size=2>&gt; </FONT><BR><FONT size=2>&gt; I don't think evt uses reference 
  counting to track channels, but it is </FONT><BR><FONT size=2>&gt; necessary 
  for checkpoints because of checkpoint retention.&nbsp; I'd rather 
  </FONT><BR><FONT size=2>&gt; try to invent a few different approaches here so 
  we can unify them </FONT><BR><FONT size=2>&gt; later once we have discovered 
  the best design.</FONT> <BR><FONT size=2>&gt; </FONT><BR><FONT size=2>&gt; 
  Synchronization after a merge or partition is the hardest part of a 
  </FONT><BR><FONT size=2>&gt; distributed system and I hope we can find a few 
  approaches to test </FONT><BR><FONT size=2>&gt; out.</FONT> <BR><FONT 
  size=2>&gt; </FONT><BR><FONT size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; 
  &gt; Thanks</FONT> <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; 
  &gt; Muni</FONT> <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; 
  -----Original Message-----</FONT> <BR><FONT size=2>&gt; &gt; From: Steven Dake 
  [<A href="mailto:sdake@mvista.com">mailto:sdake@mvista.com</A>]</FONT> 
  <BR><FONT size=2>&gt; &gt; Sent: Tuesday, February 15, 2005 1:51 PM</FONT> 
  <BR><FONT size=2>&gt; &gt; To: Smith, Kristen [NGC:B675:EXCH]</FONT> <BR><FONT 
  size=2>&gt; &gt; Cc: markh@osdl.org; openais@lists.osdl.org; Bajpai, Muni 
  </FONT><BR><FONT size=2>&gt; &gt; [NGC:B670:EXCH]</FONT> <BR><FONT size=2>&gt; 
  &gt; Subject: RE: [Openais] Checkpoint crash in aisexec</FONT> <BR><FONT 
  size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; </FONT><BR><FONT 
  size=2>&gt; &gt; On Tue, 2005-02-15 at 09:47, Kristen Smith wrote:</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; Steve,</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; Thanks for the response - I hear ya 
  loud and clear - not good</FONT> <BR><FONT size=2>&gt; &gt; without</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; recovery. So, is there something that we could 
  do to help you with </FONT><BR><FONT size=2>&gt; &gt; &gt; this recovery 
  coding? If you had some type of design thoughts on</FONT> <BR><FONT 
  size=2>&gt; how</FONT> <BR><FONT size=2>&gt; &gt; &gt; you wanted checkpoint 
  recovery to occur, maybe that is something</FONT> <BR><FONT size=2>&gt; 
  we</FONT> <BR><FONT size=2>&gt; &gt; &gt; could help out with. Just throwing 
  this out there to see what you</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  think.</FONT> <BR><FONT size=2>&gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; 
  &gt; </FONT><BR><FONT size=2>&gt; &gt; Kristen</FONT> <BR><FONT size=2>&gt; 
  &gt; You have done alot to help us so far but more help is always</FONT> 
  <BR><FONT size=2>&gt; &gt; appreciated</FONT> <BR><FONT size=2>&gt; &gt; 
  :)</FONT> <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; If 
  someone from your org wanted to get started writing code for</FONT> <BR><FONT 
  size=2>&gt; &gt; checkpoint recovery that would be great!&nbsp; I spent some 
  time in the </FONT><BR><FONT size=2>&gt; &gt; drive to work this morning 
  thinking about how checkpoint recovery </FONT><BR><FONT size=2>&gt; &gt; 
  should work:</FONT> <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; 
  &gt; There are 3 main steps that should be done in order:</FONT> <BR><FONT 
  size=2>&gt; &gt; 1. synchronize checkpoint reference counts (so retention 
  timers work</FONT> <BR><FONT size=2>&gt; &gt; properly)</FONT> <BR><FONT 
  size=2>&gt; &gt; 2. synchronize checkpoint metadata contents (sizes, sections, 
  etc)</FONT> <BR><FONT size=2>&gt; 2.</FONT> <BR><FONT size=2>&gt; &gt; 
  synchronize checkpoint section data contents</FONT> <BR><FONT size=2>&gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; The place to get started is on the reference 
  count synchronization.</FONT> <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT 
  size=2>&gt; &gt; The checkpoint must contain a list of active user's processor 
  ids</FONT> <BR><FONT size=2>&gt; &gt; along with their reference count.&nbsp; 
  So if processor A has checkpoint</FONT> <BR><FONT size=2>&gt; 1</FONT> 
  <BR><FONT size=2>&gt; &gt; open twice, and processor B has checkpoint 1 open 
  three times, and</FONT> <BR><FONT size=2>&gt; &gt; processor C has checkpoint 
  1 open four times each processor would </FONT><BR><FONT size=2>&gt; &gt; 
  maintain a list for the checkpoint (in the checkpoint data</FONT> <BR><FONT 
  size=2>&gt; structure):</FONT> <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT 
  size=2>&gt; &gt; p_A:r_2</FONT> <BR><FONT size=2>&gt; &gt; p_B:r_3</FONT> 
  <BR><FONT size=2>&gt; &gt; p_C:r_4</FONT> <BR><FONT size=2>&gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; Then on a configuration change, the leaving 
  processors would close</FONT> <BR><FONT size=2>&gt; &gt; their reference 
  counts.&nbsp; So in this example, p_B leaves then the </FONT><BR><FONT 
  size=2>&gt; &gt; processor ref count looks like: p_A:r_2 p_C:r_4</FONT> 
  <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; During this 
  configuration change, a processor joins p_D.&nbsp; It has</FONT> <BR><FONT 
  size=2>&gt; &gt; checkpoint 1 open 1 time.&nbsp; p_D gets a configuration 
  change {add p_A,</FONT> <BR><FONT size=2>&gt; &gt; p_C} and then sends a 
  synchronization message with its previous ring</FONT> <BR><FONT size=2>&gt; 
  &gt; identifier and current list of checkpoint reference counts (after</FONT> 
  <BR><FONT size=2>&gt; the</FONT> <BR><FONT size=2>&gt; &gt; above leave in the 
  configuration change was processed).&nbsp; The</FONT> <BR><FONT size=2>&gt; 
  &gt; representative of {p_A, p_C} also sends a synchronization message</FONT> 
  <BR><FONT size=2>&gt; with</FONT> <BR><FONT size=2>&gt; &gt; the previous ring 
  identifier and a current list of checkpoint</FONT> <BR><FONT size=2>&gt; &gt; 
  reference counts.&nbsp; If the previous ring identifiers match and the 
  </FONT><BR><FONT size=2>&gt; &gt; sending processor is not the delivering 
  processor then p_C should </FONT><BR><FONT size=2>&gt; &gt; ignore p_A's 
  message (ie: p_C receives p_A message, but it already </FONT><BR><FONT 
  size=2>&gt; &gt; knows about p_A's references).</FONT> <BR><FONT size=2>&gt; 
  &gt; </FONT><BR><FONT size=2>&gt; &gt; This requires us to add the ring 
  identifier to the configuration</FONT> <BR><FONT size=2>&gt; &gt; 
  change.</FONT> <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; So 
  now each previous configuration is aware of the new</FONT> <BR><FONT 
  size=2>&gt; configuration.</FONT> <BR><FONT size=2>&gt; &gt; The reference 
  counts look like:</FONT> <BR><FONT size=2>&gt; &gt; p_A:r_2</FONT> <BR><FONT 
  size=2>&gt; &gt; p_C:r_4</FONT> <BR><FONT size=2>&gt; &gt; p_D:r_1</FONT> 
  <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; The above 
  maintenence of the reference counts, or open checkpoints,</FONT> <BR><FONT 
  size=2>&gt; &gt; must maintain a per-checkpoint variable which is the 
  "reference</FONT> <BR><FONT size=2>&gt; count</FONT> <BR><FONT size=2>&gt; 
  &gt; for this checkpoint".&nbsp; In the last case, that reference count 
  would</FONT> <BR><FONT size=2>&gt; be</FONT> <BR><FONT size=2>&gt; &gt; 
  7.</FONT> <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; Each 
  time a processor leaves, its reference counts are subtracted</FONT> <BR><FONT 
  size=2>&gt; from</FONT> <BR><FONT size=2>&gt; &gt; this "global ref 
  count".&nbsp; Each time a processor is added, its</FONT> <BR><FONT size=2>&gt; 
  &gt; reference counts are added.&nbsp; This reference count is then what 
  is</FONT> <BR><FONT size=2>&gt; used</FONT> <BR><FONT size=2>&gt; &gt; for 
  retention duration.</FONT> <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT 
  size=2>&gt; &gt; Any thoughts on the above approach welcome.</FONT> <BR><FONT 
  size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; Thanks!</FONT> <BR><FONT 
  size=2>&gt; &gt; -steve</FONT> <BR><FONT size=2>&gt; &gt; </FONT><BR><FONT 
  size=2>&gt; &gt; &gt; Thanks,</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  Kristen</FONT> <BR><FONT size=2>&gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; 
  &gt; &gt; -----Original Message-----</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  From: Steven Dake [<A 
  href="mailto:sdake@mvista.com">mailto:sdake@mvista.com</A>]</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; Sent: Monday, February 14, 2005 2:17 PM</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; To: Smith, Kristen [NGC:B675:EXCH]; 
  markh@osdl.org;</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  openais@lists.osdl.org</FONT> <BR><FONT size=2>&gt; &gt; &gt; Cc: Bajpai, Muni 
  [NGC:B670:EXCH]</FONT> <BR><FONT size=2>&gt; &gt; &gt; Subject: RE: [Openais] 
  Checkpoint crash in aisexec</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; 
  On Sat, 2005-02-12 at 08:08, Kristen Smith wrote:</FONT> <BR><FONT size=2>&gt; 
  &gt; &gt; &gt; Steve,</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; Thanks for the response.</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; 
  &gt; For recovery - what are the ramifications if we don't have</FONT> 
  <BR><FONT size=2>&gt; &gt; recovery</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  &gt; working 100%? What I see now is that when a node leaves the</FONT> 
  <BR><FONT size=2>&gt; &gt; cluster</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; 
  and then rejoins, it receives evt messages, but it can take</FONT> <BR><FONT 
  size=2>&gt; &gt; anywhere</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; from 
  15seconds to minutes for evt messages sent from that node</FONT> <BR><FONT 
  size=2>&gt; to</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; reach the other 
  applications. I handle this with some</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; Mark have you seen this issue?</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; 
  message retries which is ok in this startup case. However, are</FONT> 
  <BR><FONT size=2>&gt; we</FONT> <BR><FONT size=2>&gt; &gt; in</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; &gt; jeopardy in other cases that I am not considering? 
  When running </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; traffic the past few 
  days and seeing periodic reconfigs, I don't</FONT> <BR><FONT size=2>&gt; &gt; 
  &gt; seem</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; to be losing messages 
  when that occurs - I only see the lost</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  messages</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; when I actually kill a 
  node and start it back up to rejoin the</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  &gt; cluster.</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; </FONT><BR><FONT 
  size=2>&gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; What we have 
  today is totally unacceptable because atleast for </FONT><BR><FONT size=2>&gt; 
  &gt; &gt; checkpointing, there is no recovery.&nbsp; And Mark is waiting on 
  my</FONT> <BR><FONT size=2>&gt; base</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  code for event recovery.</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; Definition of 100% working means if 
  there is a failure during </FONT><BR><FONT size=2>&gt; &gt; &gt; recovery, we 
  are guaranteed a consistent state.&nbsp; I think evt is</FONT> <BR><FONT 
  size=2>&gt; &gt; pretty</FONT> <BR><FONT size=2>&gt; &gt; &gt; close to this 
  goal, although the checkpoint replication after</FONT> <BR><FONT size=2>&gt; 
  merge</FONT> <BR><FONT size=2>&gt; &gt; &gt; has not been developed yet.&nbsp; 
  I can think of alot of easy ways to</FONT> <BR><FONT size=2>&gt; do</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; this, but handling a failure during the 
  recovery phase makes it</FONT> <BR><FONT size=2>&gt; more</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; difficult.</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; Definition of almost 100% is that 
  recovery works properly if there</FONT> <BR><FONT size=2>&gt; &gt; are</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; no faults during recovery (ie: the merge 
  process), but if there is</FONT> <BR><FONT size=2>&gt; a</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; fault during recovery (ie: reconfig) something could go 
  awry.</FONT> <BR><FONT size=2>&gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; 
  &gt; We want consistently replicated data (the 100% case).&nbsp; 100% is 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; probably past your development window; 
  the other case is within</FONT> <BR><FONT size=2>&gt; &gt; reach.</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; 
  Regards</FONT> <BR><FONT size=2>&gt; &gt; &gt; -steve</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; 
  Thanks</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; Kristen</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; 
  -----Original Message-----</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; From: 
  Steven Dake [<A 
  href="mailto:sdake@mvista.com">mailto:sdake@mvista.com</A>]</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; &gt; Sent: Friday, February 11, 2005 5:30 PM</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; To: Smith, Kristen [NGC:B675:EXCH]</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; Subject: RE: [Openais] Checkpoint crash 
  in aisexec</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; </FONT><BR><FONT 
  size=2>&gt; &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; Ok well 
  I doubt with 200 byte checkpoints there is a buffer</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; overflow.</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; 
  :)</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; 
  &gt; &gt; &gt; Recovery will come after 188 is wrapped up.&nbsp; I think your 
  two</FONT> <BR><FONT size=2>&gt; &gt; weeks</FONT> <BR><FONT size=2>&gt; &gt; 
  &gt; &gt; window looks good for alpha-level recovery (ie: works most of</FONT> 
  <BR><FONT size=2>&gt; the</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; 
  time).&nbsp; High quality production recovery will not hit your</FONT> 
  <BR><FONT size=2>&gt; window</FONT> <BR><FONT size=2>&gt; &gt; &gt; for</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; development (ie: works 100% of the time 
  no matter what happens).</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; Thanks</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; &gt; -steve</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; On Fri, 2005-02-11 at 15:56, 
  Kristen Smith wrote:</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; 
  Steve,</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; </FONT><BR><FONT 
  size=2>&gt; &gt; &gt; &gt; &gt; The size of the checkpoints are ~200 
  bytes.</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; </FONT><BR><FONT 
  size=2>&gt; &gt; &gt; &gt; &gt; I agree, valgrind is an excellent tool. We 
  will run it through</FONT> <BR><FONT size=2>&gt; &gt; and</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; &gt; see</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; 
  &gt; if that shows anything.</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; I have tried this scenario 
  maybe 30 times today (for various</FONT> <BR><FONT size=2>&gt; &gt; 
  other</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; testing) and it 
  happened maybe 10 times. For a while I could</FONT> <BR><FONT size=2>&gt; &gt; 
  &gt; &gt; reproduce</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; with a 
  given test about 5 times and then it hasn't happened</FONT> <BR><FONT 
  size=2>&gt; &gt; again.</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; Sounds like defect-188 fixing 
  is going well. May I ask how the </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; 
  &gt; recovery work is going as well? (Don't mean to be pushy on</FONT> 
  <BR><FONT size=2>&gt; that</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; 
  front</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; - we have 2 more weeks 
  of coding for our application left and</FONT> <BR><FONT size=2>&gt; I</FONT> 
  <BR><FONT size=2>&gt; &gt; am</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; 
  really hoping that we are able to put the new recovery code in</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; during</FONT> <BR><FONT size=2>&gt; &gt; 
  &gt; &gt; &gt; that time).</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; Thanks a bunch,</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; Kristen</FONT> <BR><FONT size=2>&gt; 
  &gt; &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; 
  -----Original Message-----</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; 
  From: Steven Dake [<A 
  href="mailto:sdake@mvista.com">mailto:sdake@mvista.com</A>]</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; &gt; &gt; Sent: Friday, February 11, 2005 4:37 PM</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; To: Smith, Kristen 
  [NGC:B675:EXCH]</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; Subject: Re: 
  [Openais] Checkpoint crash in aisexec</FONT> <BR><FONT size=2>&gt; &gt; &gt; 
  &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; </FONT><BR><FONT 
  size=2>&gt; &gt; &gt; &gt; &gt; how large are the read or write 
  requests?</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; just a thought 
  there could be some buffer overrun with larger </FONT><BR><FONT size=2>&gt; 
  &gt; &gt; &gt; &gt; requests.</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; On Fri, 2005-02-11 at 14:55, 
  Kristen Smith wrote:</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; 
  Steve,</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; </FONT><BR><FONT 
  size=2>&gt; &gt; &gt; &gt; &gt; &gt; We are periodically seeing aisexec crash 
  with the following</FONT> <BR><FONT size=2>&gt; &gt; &gt; trace:</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; 
  &gt; &gt; &gt; &gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (gdb) 
  bt</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; 
  &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #0&nbsp; 
  message_handler_req_lib_ckpt_checkpointclose</FONT> <BR><FONT size=2>&gt; &gt; 
  &gt; &gt; &gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
  (conn_info=0x0, message=0xb73fc008) at ckpt.c:1552</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; &gt; &gt; 
  &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #1&nbsp; 0x080494c2 in 
  poll_handler_libais_deliver</FONT> <BR><FONT size=2>&gt; &gt; 
  (handle=0,</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; fd=3,</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; 
  &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; revent=134633824, 
  data=0x89c2ad8,</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; 
  &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
  prio=0x89b2784) at main.c:578</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; 
  &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #2&nbsp; 0x08056e62 in 
  poll_run (handle=0) at</FONT> <BR><FONT size=2>&gt; aispoll.c:386</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; 
  &gt; &gt; &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; 
  #3&nbsp; 0x080499ac in main (argc=1, argv=0xbfffcb64) at</FONT> <BR><FONT 
  size=2>&gt; &gt; main.c:1003</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; 
  &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; We have looked 
  through the code but can't seem to figure out</FONT> <BR><FONT size=2>&gt; 
  &gt; how</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; conn_info is 
  getting set to 0. Do you have any idea under</FONT> <BR><FONT size=2>&gt; 
  what</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; circumstances 
  conn_info could be null when this function is</FONT> <BR><FONT size=2>&gt; 
  &gt; &gt; &gt; called?</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; This is happening when 
  we have multiple nodes up and we kill</FONT> <BR><FONT size=2>&gt; &gt; 
  one</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; of</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; &gt; &gt; &gt; the active nodes. The standby node (which 
  was reading</FONT> <BR><FONT size=2>&gt; &gt; &gt; checkpoints)</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; must now become a writer, so it 
  closes the checkpoint and</FONT> <BR><FONT size=2>&gt; this</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; &gt; &gt; &gt; happens. Unfortunately, I can't reproduce 
  this consistently</FONT> <BR><FONT size=2>&gt; -</FONT> <BR><FONT size=2>&gt; 
  &gt; I</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; finally got a 
  core dump today. I don't recall ever seeing</FONT> <BR><FONT size=2>&gt; 
  this</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; with</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; &gt; &gt; &gt; the old code.</FONT> <BR><FONT 
  size=2>&gt; &gt; &gt; &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; 
  &gt; &gt; &gt; Thanks,</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; 
  Kristen</FONT> <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; </FONT><BR><FONT 
  size=2>&gt; &gt; &gt; &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; 
  &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt;</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt;</FONT> <BR><FONT size=2>&gt; &gt; 
  &gt; &gt;</FONT> <BR><FONT size=2>&gt; &gt; &gt;</FONT> <BR><FONT size=2>&gt; 
  &gt;</FONT> <BR><FONT size=2>&gt; 
  ______________________________________________________________________</FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; &gt; 
  _______________________________________________</FONT> <BR><FONT size=2>&gt; 
  &gt; &gt; &gt; &gt; &gt; Openais mailing list</FONT> <BR><FONT size=2>&gt; 
  &gt; &gt; &gt; &gt; &gt; Openais@lists.osdl.org</FONT> <BR><FONT size=2>&gt; 
  &gt; &gt; &gt; &gt; <A href="http://lists.osdl.org/mailman/listinfo/openais" 
  target=_blank>http://lists.osdl.org/mailman/listinfo/openais</A></FONT> 
  <BR><FONT size=2>&gt; &gt; &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; 
  &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; &gt; </FONT><BR><FONT 
  size=2>&gt; &gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; &gt; </FONT><BR><FONT size=2>&gt; &gt; 
  </FONT><BR><FONT size=2>&gt; &gt; </FONT><BR><FONT size=2>&gt; 
  </FONT><BR><FONT size=2>&gt; </FONT></P><BR></BLOCKQUOTE></BODY></HTML>

_______________________________________________
Openais mailing list
Openais@lists.osdl.org
http://lists.osdl.org/mailman/listinfo/openais

[prev in list] [next in list] [prev in thread] [next in thread]