[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openais
Subject:    [Openais] Bug 169. Assert on Ifdown
From:       "Muni Bajpai" <muniba () nortel ! com>
Date:       2005-03-29 5:45:29
Message-ID: CFCE7C3BDB79204092974B5B50AD71941002F0 () zrc2hxm0 ! corp ! nortel ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


Hey Steve,

I've added the changes but evt's call to clm_get_by_nodeid still fails
because even though I've changed this_ip to a pointer clm_init does a memcpy
to store the node_id.

I could change that but didn't know the approach you want to take.

Could you try out this patch and let me know

Thanks again for the good feedback

Thanks

Muni 

-----Original Message-----
From: Steven Dake [mailto:sdake@mvista.com] 
Sent: Monday, March 28, 2005 1:27 PM
To: Bajpai, Muni [NGC:B670:EXCH]
Cc: Smith, Kristen [NGC:B670:EXCH]; 'Openais List'
Subject: RE: Bug 169. Assert on Ifdown

Muni
I had a look and understand the root of the problem.

The cluster membership service, along with about every other service,
uses the global "this_ip" to uniquely identify itself.  If the processor
ID changes (from 192.168.1.5 to 127.0.0.1) this_ip is not updated.  What
is happening in the evt case is that cluster membership has old
"this_ip" data and sends it out.  Then event service can't find any
reference for 127.0.0.1 in the cluster membership database which results
in assert.

This will affect every other service negatively as well.

I suggest changing this_ip to a pointer.  Then ensure that this_ip
points to the interface data structure.  Then when that data structure
is updated by your localhost bind code, the rest of the services will
then use the new this_ip data which is the ip address of the current
processor.

Please note that we should not modify this_ip directly from totemsrp.. 
I want totemsrp (and the libtotem) to stand alone and not have external
dependencies on the this_ip data structure...

The patch looks pretty good.  With the above change, you should be
close.

The localhost build code could be refined..  Don't we always know that
127.0.0.1 is the localhost IP address?  If so, we shouldn't have to
search for it.

Thanks
-steve


On Mon, 2005-03-28 at 10:37, Muni Bajpai wrote:
> Latest version
> 
> Thanks
> 
> Muni
> 
> -----Original Message-----
> From: Bajpai, Muni [NGC:B670:EXCH] 
> Sent: Sunday, March 27, 2005 5:54 PM
> To: 'sdake@mvista.com'
> Cc: Smith, Kristen [NGC:B670:EXCH]; Openais List
> Subject: RE: Bug 169. Assert on Ifdown
> 
> 
> Steve please take a look at the patch attached.
> 
> Evt recovery asserts (I cant get past that as I'm not sure what marks
> logic for that is) if we use the loopback address on a network down or
> go back to the i/f addr when the if comes back up. So that needs to be
> fixed. I've pasted the log.
> 
> What do you think?
> 
> Thanks
> 
> Muni
> 
> [root@linux104 openais]# exec/aisexec
> Mar 27 16:55:23 [NOTICE  ] [MAIN ] AIS Executive Service: Copyright
> (C) 2002-2004 MontaVista Software, Inc. and contributors. Mar 27
> 16:55:23 [NOTICE  ] [GMI  ] Created or loaded sequence id 36.10.1.1.1
> for this ring. Mar 27 16:55:23 [NOTICE  ] [GMI  ]  The network
> interface is now up. Mar 27 16:55:23 [NOTICE  ] [GMI  ] entering
> GATHER state. Mar 27 16:55:23 [NOTICE  ] [GMI  ] entering GATHER
> state. Mar 27 16:55:23 [NOTICE  ] [MAIN ] AIS Executive Service:
> started and ready to receive connections. Mar 27 16:55:23 [NOTICE  ]
> [GMI  ] Creating commit token because I am the rep. Mar 27 16:55:23
> [NOTICE  ] [GMI  ] Storing new sequence id for ring 40 Mar 27 16:55:23
> [NOTICE  ] [GMI  ] entering COMMIT state. position [0] member
> 10.1.1.1: previous ring seq 36 rep 10.1.1.1 aru 0 high delivered 0
> received flag 1 Mar 27 16:55:23 [NOTICE  ] [GMI  ] entering RECOVERY
> state. Mar 27 16:55:23 [NOTICE  ] [GMI  ] Sending initial ORF token
> token retrans flag is 0 my set retrans flag1 retrans queue empty 1
> count 0, low_water 0 aru 0 install seq 0 aru 0 high seq received 0
> token retrans flag is 0 my set retrans flag1 retrans queue empty 1
> count 1, low_water 0 aru 0 install seq 0 aru 0 high seq received 0
> token retrans flag is 0 my set retrans flag1 retrans queue empty 1
> count 2, low_water 0 aru 0 install seq 0 aru 0 high seq received 0
> token retrans flag is 0 my set retrans flag1 retrans queue empty 1
> count 3, low_water 0 aru 0 install seq 0 aru 0 high seq received 0
> retrans flag count 4 token aru 0 install seq 0 aru 0 0 recovery to
> regular 1-0 Delivering to app 0 to 0 Mar 27 16:55:23 [NOTICE  ] [CLM 
> ] CLM CONFIGURATION CHANGE Mar 27 16:55:23 [NOTICE  ] [CLM  ] New
> Configuration: Mar 27 16:55:23 [NOTICE  ] [CLM  ] Members Left: Mar 27
> 16:55:23 [NOTICE  ] [CLM  ] Members Joined: Mar 27 16:55:23 [NOTICE  ]
> [CLM  ] CLM CONFIGURATION CHANGE Mar 27 16:55:23 [NOTICE  ] [CLM  ]
> New Configuration:
> 
> Mar 27 16:55:23 [NOTICE  ] [CLM  ]      10.1.1.1
> Mar 27 16:55:23 [NOTICE  ] [CLM  ] Members Left:
> Mar 27 16:55:23 [NOTICE  ] [CLM  ] Members Joined:
> Mar 27 16:55:23 [NOTICE  ] [CLM  ]      10.1.1.1
> Mar 27 16:55:23 [NOTICE  ] [GMI  ] entering OPERATIONAL state. Mar 27
> 16:55:23 [NOTICE  ] [SYNC ] Synchronization barrier completed Mar 27
> 16:55:23 [NOTICE  ] [CLM  ] got nodejoin message 10.1.1.1 Mar 27
> 16:55:23 [NOTICE  ] [SYNC ] Synchronization barrier completed Mar 27
> 16:55:23 [NOTICE  ] [SYNC ] Synchronization barrier completed Mar 27
> 16:55:23 [NOTICE  ] [SYNC ] Synchronization barrier completed Couldn't
> send token to addr 10.1.1.1 Invalid argument 4 !!!!!! Interface is
> down binding to loopback addr. !!!!! Mar 27 16:55:33 [NOTICE  ] [GMI 
> ] The network interface is down. Mar 27 16:55:33 [NOTICE  ] [GMI  ]
> entering GATHER state. Mar 27 16:55:33 [NOTICE  ] [GMI  ] entering
> GATHER state. Mar 27 16:55:33 [NOTICE  ] [GMI  ] Creating commit token
> because I am the rep. Mar 27 16:55:33 [NOTICE  ] [GMI  ] Storing new
> sequence id for ring 44 Mar 27 16:55:33 [NOTICE  ] [GMI  ] entering
> COMMIT state. position [0] member 127.0.0.1: previous ring seq 40 rep
> 10.1.1.1 aru 6 high delivered 6 received flag 1 Mar 27 16:55:33
> [NOTICE  ] [GMI  ] entering RECOVERY state. Mar 27 16:55:33 [NOTICE  ]
> [GMI  ] Sending initial ORF token token retrans flag is 0 my set
> retrans flag1 retrans queue empty 1 count 0, low_water 0 aru 0 install
> seq 0 aru 0 high seq received 0 token retrans flag is 0 my set retrans
> flag1 retrans queue empty 1 count 1, low_water 0 aru 0 install seq 0
> aru 0 high seq received 0 token retrans flag is 0 my set retrans flag1
> retrans queue empty 1 count 2, low_water 0 aru 0 install seq 0 aru 0
> high seq received 0 token retrans flag is 0 my set retrans flag1
> retrans queue empty 1 count 3, low_water 0 aru 0 install seq 0 aru 0
> high seq received 0 retrans flag count 4 token aru 0 install seq 0 aru
> 0 0 recovery to regular 1-0 Delivering to app 6 to 0 Mar 27 16:55:33
> [NOTICE  ] [CLM  ] CLM CONFIGURATION CHANGE Mar 27 16:55:33 [NOTICE  ]
> [CLM  ] New Configuration: Mar 27 16:55:33 [NOTICE  ] [CLM  ] Members
> Left:
> 
> Mar 27 16:55:33 [NOTICE  ] [CLM  ]      10.1.1.1
> Mar 27 16:55:33 [NOTICE  ] [CLM  ] Members Joined:
> Mar 27 16:55:33 [NOTICE  ] [CLM  ] CLM CONFIGURATION CHANGE
> Mar 27 16:55:33 [NOTICE  ] [CLM  ] New Configuration:
> Mar 27 16:55:33 [NOTICE  ] [CLM  ]      127.0.0.1
> Mar 27 16:55:33 [NOTICE  ] [CLM  ] Members Left:
> Mar 27 16:55:33 [NOTICE  ] [CLM  ] Members Joined:
> Mar 27 16:55:33 [NOTICE  ] [CLM  ]      127.0.0.1
> Mar 27 16:55:33 [NOTICE  ] [GMI  ] entering OPERATIONAL state. Mar 27
> 16:55:33 [NOTICE  ] [SYNC ] Synchronization barrier completed Mar 27
> 16:55:33 [NOTICE  ] [CLM  ] got nodejoin message 10.1.1.1 Mar 27
> 16:55:33 [NOTICE  ] [SYNC ] Synchronization barrier completed Mar 27
> 16:55:33 [NOTICE  ] [SYNC ] Synchronization barrier completed
> 
> Mar 27 16:55:33 [ERROR   ] [EVT  ] recovery error node: 127.0.0.1 not
> found
> aisexec: evt.c:3812: evt_sync_process: Assertion `0' failed. Aborted
> 
> 
> -----Original Message-----
> From: Steven Dake [mailto:sdake@mvista.com] 
> Sent: Friday, March 25, 2005 4:49 PM
> To: Bajpai, Muni [NGC:B670:EXCH]
> Cc: Smith, Kristen [NGC:B670:EXCH]; Openais List
> Subject: RE: Bug 169. Assert on Ifdown
> 
> On Fri, 2005-03-25 at 15:28, Muni Bajpai wrote:
> > Hey Steve,
> > 
> > I tried out the updates to the patch and the reason why testevs is
> not 
> > printing anything is because it is blocking on dispatch that never 
> > arrives.
> > 
> 
> yes but I think this is not the root cause.  I believe that totem is
> not ordering messages at all because they are not being self delivered
> to the message_handler_mcast code.
> 
> > Now the other issue is that on ifdowning an interface the mcast and 
> > unicast's to self are also not delivered. This is a problem because 
> > most openais transactions are request/response pairs and the absence
> > of
> > 
> > The response caused unpredictable results.
> > 
> 
> Muni
> 
> I think the unicasts to self are delivered, just not the multicasts. 
> There is a flag which is specified in the bind command which tells the
> stack to self-deliver multicast messages.  It appears for some reason
> this is not occuring.  Maybe the socket couldn't be bound, or some
> other
> issue occurs.
> 
> We need to have a way to ensure self delivery even after the interface
> is downed.  I'm not sure if there is some form of network trick we
> could
> use..  maybe open a localhost socket and use that for mcast.  Or maybe
> set the mcast transmit socket to the unicast socket and unicast
> transmit
> multicast and membership messages.  This last approach is probably the
> easiest, requiring some minor changes to:
> set the multicast fd to the unicast fd and
> set the multicast target address to the unicast address
> when:
> the ring has 1 or less members and
> the interface is detected as down
> 
> > Also while debugging I saw an anomaly where totemsrp was polling
> > The libais fd in addition to main.c. I don't know how that happened
> > But that was an easy fix where the totemsrp dispatch is started only
> > when the network is up.
> 
> Ok i think this is the root of our problems!  This could indicate one
> of
> the socket/bind calls didn't complete successfully during the binding
> process.  If this is the case, then later that same fd is added by the
> libais dispatch.   We should track down the source of this failure and
> fix the error return...
> 
> We really want totemsrp to dispatch multicasts and unicasts even when
> the network interface is down.  We just have to figure out how this
> should be done.
> 
> Another possible option when the interface is down is to create a pipe
> using the pipe system call and using one end for the send, and one end
> for the receive/delivery.  This way we can avoid the bind process
> entirely.  I'd rather not make major changes to any of the sendmsg
> calls
> in totem, but rather, would have this functionality in the initialize
> phase or during network up/down detection.  I'd like to keep the rest
> of
> the code as minimally impacted as possible.
> 
> 
> > 
> > In rethinking this issue I feel that the best way to handle the
> ifdown
> > Is a graceful shutdown (This being different from network isolation
> > where
> > A cluster of one is formed). Otherwise we have to guarantee all
> > outstanding
> > Transactions requests be queued for retransmits.
> > 
> 
> It is suboptimal to absolutely require a network interface for openais
> to operate.  I can think of situations where applications may not want
> the network interface up, but still want to provide ais style
> services. 
> The graceful shutdown approach would mean that ais would be unable to
> provide any services unless a network interface is available.  I am
> not
> fond of this approach..
> 
> > What do you think the expected behavior should be on a nic failure
> ??
> > 
> 
> Nic failure (meaining that the nic is totally unable to transmit or
> receive any data) should cause two configurations 1 with the processor
> containing the defective nic but still be able to provide services to
> applications.  The remaining processors should be in the second
> configuration.
> 
> What the applications do in this scenario should be application
> specific.  They could stonith the failed processor, or they could
> continue to allow it to operate as a seperate partition.  At the
> moment
> it is not our concern to solve this particular application specific
> issue..
> 
> > Thanks
> > 
> > Muni
> > -----Original Message-----
> > From: Steven Dake [mailto:sdake@mvista.com] 
> > Sent: Thursday, March 24, 2005 5:58 PM
> > To: Bajpai, Muni [NGC:B670:EXCH]
> > Cc: Smith, Kristen [NGC:B670:EXCH]; Openais List
> > Subject: RE: Bug 169. Assert on Ifdown
> > 
> > Muni
> > 
> > the patch is a good start but has some problems
> > 
> > I spent 30 mins or so debugging and have attached a patch.
> > 
> > Find attached my patch (to your patch) which fixes most of the
> > problems.  Please note the ioctl to get the IFF flags is required. 
> > The
> > linux kernel doesn't copy them on the other ioctl operation (even
> > though
> > I think it should but oh well).
> > 
> > I also sanitized the up down state detection.
> > 
> > One problem still remains, however.  Perhaps you could look at this
> > test
> > case and get it to work.
> > 
> > processor A, processor B within a configuration
> > ifdown the interface on processor A
> > * run testevs on processor A which generates no output (and doesn't
> > order messages)
> > ifup the interface on processor A
> > run testevs on processor A which generates correct output (and
> orders
> > messages).
> > 
> > fix the * 
> > 
> > it appears the processor doesn't order messages (maybe they are not
> > being self-delivered since the interface is now down)..  We need to
> > get
> > this to work for the case where the interface is downed, but we
> still
> > want operational applications on the processor.
> > 
> > Regards
> > -steve
> > 
> > On Thu, 2005-03-24 at 15:02, Muni Bajpai wrote:
> > > Hey Steve,
> > > 
> > > Please review the patch for defect 169. One issue is still
> > outstanding
> > > on this patch though.
> > > Here is the scenario
> > > 
> > > 1.) Node 1 and Node 2 are up. Node 1 has the ckpt-wr invoked
> > > 2.) Node2 ifdown
> > > 3.) Node2 ifup and the cluster is reformed.
> > > 
> > > If the ckpt-wr is not running the step 3 above results in 2
> > individual
> > > clusters as the JOIN from node2 is never delivered to openais
> after
> > it
> > > is taken off the wire by Node1. I cant seem to figure out why that
> > is
> > > happening ?
> > > 
> > > Thanks
> > > 
> > > Muni
> > > 
> > > -----Original Message-----
> > > From: Steven Dake [mailto:sdake@mvista.com] 
> > > Sent: Wednesday, March 23, 2005 6:18 PM
> > > To: Bajpai, Muni [NGC:B670:EXCH]
> > > Cc: Smith, Kristen [NGC:B670:EXCH]; Openais List
> > > Subject: RE: Bug 169. Assert on Ifdown
> > > 
> > > 
> > > Muni
> > > Thanks for considering resurrecting the patch for this defect. 
> All
> > > the work I did on defect 169 is in the attached tarball.
> > > 
> > > If you diff defect-169.orig and defect-169 you should be able to
> see
> > > the changes I made to support this.
> > > 
> > > This tarball also seems to include a bunch of other interim
> patches
> > > which you should not include.  I suggest making a patch against
> the
> > > new totemsrp code (keep in mind the defect was worked against gmi,
> > the
> > > old ring protocol).  Most or all of the changes are in or related
> to
> > > the following symbols:
> > > 
> > > netif_down_report_down
> > > netif_down_check
> > > netif_determine
> > > build_sockets
> > > timer_function_netif_check_timeout
> > > 
> > > The asserts should be removed.
> > > 
> > > You will probably not be able to cleanly apply this patch but
> > instead
> > > have to hand apply the changes and then do some debug work to
> clean
> > it
> > > up.
> > > 
> > > The test case is to ifdown the interface in use while messages are
> > > being transmitted (one of the benchmark programs is a good message
> > > generator), then ifup the interface and make sure the ring reforms
> > > when the interface comes back up.
> > > 
> > > To keep the protocol running while the interface is down, you will
> > > need UDP messages to be self-delivered on the multicast port and
> > > unicast port.  This is required so that the membership protocol
> can
> > > form an OPERATIONAL configuration and then order messages based
> upon
> > > token reception.  I'm not sure how to ensure self-delivery of UDP
> > > messages without looking at the code in more detail.  As I recall,
> > the
> > > current work closes the bound sockets as soon as the network
> > interface
> > > is detected down.  Closing the fds would likely result in non-self
> > > delivery and also token timeouts sine the token isn't being
> > > delivered.  This is the most difficult issue of the problem.
> > > 
> > > The solution in the previous patch was to still sendmsg the
> messages
> > > when the interface is down and ignore any errors that occur.  I am
> > > satisfied with this approach, but we must also continue to ensure
> > the
> > > protocol stays in the OPERATIONAL state and executes the state
> > > machines as required.
> > > 
> > > Please note during development we tried not closing the socket. 
> > This
> > > works on some linux systems but not on others.  The only safe
> > solution
> > > is to close the socket, then repoen it when the network interface
> is
> > > ifup'ed.
> > > 
> > > Thanks
> > > -steve
> > > 
> > > On Wed, 2005-03-23 at 15:54, Muni Bajpai wrote:
> > > > Sure go ahead and send me the tarball, but if the real deal is
> to
> > > let 
> > > > the protocol running what changes do you think would entail that
> ?
> > > > 
> > > > I'm not sure if I have the bandwidth to look at it but send me
> the
> > > > code. Could you also cursively enumerate the items that require
> > work
> > > > for this fix
> > > > 
> > > > This is a high priority for us to be truly a redundant system.
> > > > 
> > > > Once again thanks
> > > > 
> > > > Muni
> > > > 
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Steven Dake [mailto:sdake@mvista.com]
> > > > Sent: Wednesday, March 23, 2005 4:37 PM
> > > > To: Bajpai, Muni [NGC:B670:EXCH]
> > > > Cc: Smith, Kristen [NGC:B670:EXCH]; Openais List
> > > > Subject: Re: Bug 169. Assert on Ifdown
> > > > 
> > > > 
> > > > I have a patch sitting around on my harddisk somewhere
> > > > 
> > > > Basically what it does is monitor the state of the network
> device.
> > > > When it goes up, the protocol is started and bound to the
> network
> > > > interfaces.
> > > > 
> > > > When it goes down, the protocol is stopped.
> > > > 
> > > > This allows the interface to be upped or downed on command.
> > > > 
> > > > What I really want is a way for the protocol to operate when the
> > > > network interface is downed.  But then when it comes back up, it
> > > will 
> > > > try to form a configuration.
> > > > 
> > > > If your interested in resurrecting this patch, I can probably
> send
> > > you 
> > > > a tarball to review the techniques for monitoring the up/down
> > state
> > > of 
> > > > the network device...
> > > > 
> > > > If not, I'll get to it after 188 and the sq_add assert is
> > resolved.
> > > > 
> > > > Regards
> > > > -steve
> > > > 
> > > 
> > >  
> > 
> 
> 
>  




[Attachment #5 (text/html)]

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2658.2">
<TITLE>Bug 169. Assert on Ifdown</TITLE>
</HEAD>
<BODY>

<P><FONT SIZE=2>Hey Steve,</FONT>
</P>

<P><FONT SIZE=2>I've added the changes but evt's call to clm_get_by_nodeid still \
fails because even though I've changed this_ip to a pointer clm_init does a memcpy to \
store the node_id.</FONT></P>

<P><FONT SIZE=2>I could change that but didn't know the approach you want to \
take.</FONT> </P>

<P><FONT SIZE=2>Could you try out this patch and let me know</FONT>
</P>

<P><FONT SIZE=2>Thanks again for the good feedback</FONT>
</P>

<P><FONT SIZE=2>Thanks</FONT>
</P>

<P><FONT SIZE=2>Muni </FONT>
</P>

<P><FONT SIZE=2>-----Original Message-----</FONT>
<BR><FONT SIZE=2>From: Steven Dake [<A \
HREF="mailto:sdake@mvista.com">mailto:sdake@mvista.com</A>] </FONT> <BR><FONT \
SIZE=2>Sent: Monday, March 28, 2005 1:27 PM</FONT> <BR><FONT SIZE=2>To: Bajpai, Muni \
[NGC:B670:EXCH]</FONT> <BR><FONT SIZE=2>Cc: Smith, Kristen [NGC:B670:EXCH]; 'Openais \
List'</FONT> <BR><FONT SIZE=2>Subject: RE: Bug 169. Assert on Ifdown</FONT>
</P>

<P><FONT SIZE=2>Muni</FONT>
<BR><FONT SIZE=2>I had a look and understand the root of the problem.</FONT>
</P>

<P><FONT SIZE=2>The cluster membership service, along with about every other \
service,</FONT> <BR><FONT SIZE=2>uses the global &quot;this_ip&quot; to uniquely \
identify itself.&nbsp; If the processor</FONT> <BR><FONT SIZE=2>ID changes (from \
192.168.1.5 to 127.0.0.1) this_ip is not updated.&nbsp; What</FONT> <BR><FONT \
SIZE=2>is happening in the evt case is that cluster membership has old</FONT> \
<BR><FONT SIZE=2>&quot;this_ip&quot; data and sends it out.&nbsp; Then event service \
can't find any</FONT> <BR><FONT SIZE=2>reference for 127.0.0.1 in the cluster \
membership database which results</FONT> <BR><FONT SIZE=2>in assert.</FONT>
</P>

<P><FONT SIZE=2>This will affect every other service negatively as well.</FONT>
</P>

<P><FONT SIZE=2>I suggest changing this_ip to a pointer.&nbsp; Then ensure that \
this_ip</FONT> <BR><FONT SIZE=2>points to the interface data structure.&nbsp; Then \
when that data structure</FONT> <BR><FONT SIZE=2>is updated by your localhost bind \
code, the rest of the services will</FONT> <BR><FONT SIZE=2>then use the new this_ip \
data which is the ip address of the current</FONT> <BR><FONT SIZE=2>processor.</FONT>
</P>

<P><FONT SIZE=2>Please note that we should not modify this_ip directly from \
totemsrp.. </FONT> <BR><FONT SIZE=2>I want totemsrp (and the libtotem) to stand alone \
and not have external</FONT> <BR><FONT SIZE=2>dependencies on the this_ip data \
structure...</FONT> </P>

<P><FONT SIZE=2>The patch looks pretty good.&nbsp; With the above change, you should \
be</FONT> <BR><FONT SIZE=2>close.</FONT>
</P>

<P><FONT SIZE=2>The localhost build code could be refined..&nbsp; Don't we always \
know that</FONT> <BR><FONT SIZE=2>127.0.0.1 is the localhost IP address?&nbsp; If so, \
we shouldn't have to</FONT> <BR><FONT SIZE=2>search for it.</FONT>
</P>

<P><FONT SIZE=2>Thanks</FONT>
<BR><FONT SIZE=2>-steve</FONT>
</P>
<BR>

<P><FONT SIZE=2>On Mon, 2005-03-28 at 10:37, Muni Bajpai wrote:</FONT>
<BR><FONT SIZE=2>&gt; Latest version</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Thanks</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Muni</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; -----Original Message-----</FONT>
<BR><FONT SIZE=2>&gt; From: Bajpai, Muni [NGC:B670:EXCH] </FONT>
<BR><FONT SIZE=2>&gt; Sent: Sunday, March 27, 2005 5:54 PM</FONT>
<BR><FONT SIZE=2>&gt; To: 'sdake@mvista.com'</FONT>
<BR><FONT SIZE=2>&gt; Cc: Smith, Kristen [NGC:B670:EXCH]; Openais List</FONT>
<BR><FONT SIZE=2>&gt; Subject: RE: Bug 169. Assert on Ifdown</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Steve please take a look at the patch attached.</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Evt recovery asserts (I cant get past that as I'm not sure what \
marks</FONT> <BR><FONT SIZE=2>&gt; logic for that is) if we use the loopback address \
on a network down or</FONT> <BR><FONT SIZE=2>&gt; go back to the i/f addr when the if \
comes back up. So that needs to be</FONT> <BR><FONT SIZE=2>&gt; fixed. I've pasted \
the log.</FONT> <BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; What do you think?</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Thanks</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Muni</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; [root@linux104 openais]# exec/aisexec</FONT>
<BR><FONT SIZE=2>&gt; Mar 27 16:55:23 [NOTICE&nbsp; ] [MAIN ] AIS Executive Service: \
Copyright</FONT> <BR><FONT SIZE=2>&gt; (C) 2002-2004 MontaVista Software, Inc. and \
contributors. Mar 27</FONT> <BR><FONT SIZE=2>&gt; 16:55:23 [NOTICE&nbsp; ] [GMI&nbsp; \
] Created or loaded sequence id 36.10.1.1.1</FONT> <BR><FONT SIZE=2>&gt; for this \
ring. Mar 27 16:55:23 [NOTICE&nbsp; ] [GMI&nbsp; ]&nbsp; The network</FONT> <BR><FONT \
SIZE=2>&gt; interface is now up. Mar 27 16:55:23 [NOTICE&nbsp; ] [GMI&nbsp; ] \
entering</FONT> <BR><FONT SIZE=2>&gt; GATHER state. Mar 27 16:55:23 [NOTICE&nbsp; ] \
[GMI&nbsp; ] entering GATHER</FONT> <BR><FONT SIZE=2>&gt; state. Mar 27 16:55:23 \
[NOTICE&nbsp; ] [MAIN ] AIS Executive Service:</FONT> <BR><FONT SIZE=2>&gt; started \
and ready to receive connections. Mar 27 16:55:23 [NOTICE&nbsp; ]</FONT> <BR><FONT \
SIZE=2>&gt; [GMI&nbsp; ] Creating commit token because I am the rep. Mar 27 \
16:55:23</FONT> <BR><FONT SIZE=2>&gt; [NOTICE&nbsp; ] [GMI&nbsp; ] Storing new \
sequence id for ring 40 Mar 27 16:55:23</FONT> <BR><FONT SIZE=2>&gt; [NOTICE&nbsp; ] \
[GMI&nbsp; ] entering COMMIT state. position [0] member</FONT> <BR><FONT SIZE=2>&gt; \
10.1.1.1: previous ring seq 36 rep 10.1.1.1 aru 0 high delivered 0</FONT> <BR><FONT \
SIZE=2>&gt; received flag 1 Mar 27 16:55:23 [NOTICE&nbsp; ] [GMI&nbsp; ] entering \
RECOVERY</FONT> <BR><FONT SIZE=2>&gt; state. Mar 27 16:55:23 [NOTICE&nbsp; ] \
[GMI&nbsp; ] Sending initial ORF token</FONT> <BR><FONT SIZE=2>&gt; token retrans \
flag is 0 my set retrans flag1 retrans queue empty 1</FONT> <BR><FONT SIZE=2>&gt; \
count 0, low_water 0 aru 0 install seq 0 aru 0 high seq received 0</FONT> <BR><FONT \
SIZE=2>&gt; token retrans flag is 0 my set retrans flag1 retrans queue empty 1</FONT> \
<BR><FONT SIZE=2>&gt; count 1, low_water 0 aru 0 install seq 0 aru 0 high seq \
received 0</FONT> <BR><FONT SIZE=2>&gt; token retrans flag is 0 my set retrans flag1 \
retrans queue empty 1</FONT> <BR><FONT SIZE=2>&gt; count 2, low_water 0 aru 0 install \
seq 0 aru 0 high seq received 0</FONT> <BR><FONT SIZE=2>&gt; token retrans flag is 0 \
my set retrans flag1 retrans queue empty 1</FONT> <BR><FONT SIZE=2>&gt; count 3, \
low_water 0 aru 0 install seq 0 aru 0 high seq received 0</FONT> <BR><FONT \
SIZE=2>&gt; retrans flag count 4 token aru 0 install seq 0 aru 0 0 recovery to</FONT> \
<BR><FONT SIZE=2>&gt; regular 1-0 Delivering to app 0 to 0 Mar 27 16:55:23 \
[NOTICE&nbsp; ] [CLM </FONT> <BR><FONT SIZE=2>&gt; ] CLM CONFIGURATION CHANGE Mar 27 \
16:55:23 [NOTICE&nbsp; ] [CLM&nbsp; ] New</FONT> <BR><FONT SIZE=2>&gt; Configuration: \
Mar 27 16:55:23 [NOTICE&nbsp; ] [CLM&nbsp; ] Members Left: Mar 27</FONT> <BR><FONT \
SIZE=2>&gt; 16:55:23 [NOTICE&nbsp; ] [CLM&nbsp; ] Members Joined: Mar 27 16:55:23 \
[NOTICE&nbsp; ]</FONT> <BR><FONT SIZE=2>&gt; [CLM&nbsp; ] CLM CONFIGURATION CHANGE \
Mar 27 16:55:23 [NOTICE&nbsp; ] [CLM&nbsp; ]</FONT> <BR><FONT SIZE=2>&gt; New \
Configuration:</FONT> <BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Mar 27 16:55:23 [NOTICE&nbsp; ] [CLM&nbsp; \
]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 10.1.1.1</FONT> <BR><FONT SIZE=2>&gt; Mar 27 16:55:23 \
[NOTICE&nbsp; ] [CLM&nbsp; ] Members Left:</FONT> <BR><FONT SIZE=2>&gt; Mar 27 \
16:55:23 [NOTICE&nbsp; ] [CLM&nbsp; ] Members Joined:</FONT> <BR><FONT SIZE=2>&gt; \
Mar 27 16:55:23 [NOTICE&nbsp; ] [CLM&nbsp; ]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \
10.1.1.1</FONT> <BR><FONT SIZE=2>&gt; Mar 27 16:55:23 [NOTICE&nbsp; ] [GMI&nbsp; ] \
entering OPERATIONAL state. Mar 27</FONT> <BR><FONT SIZE=2>&gt; 16:55:23 \
[NOTICE&nbsp; ] [SYNC ] Synchronization barrier completed Mar 27</FONT> <BR><FONT \
SIZE=2>&gt; 16:55:23 [NOTICE&nbsp; ] [CLM&nbsp; ] got nodejoin message 10.1.1.1 Mar \
27</FONT> <BR><FONT SIZE=2>&gt; 16:55:23 [NOTICE&nbsp; ] [SYNC ] Synchronization \
barrier completed Mar 27</FONT> <BR><FONT SIZE=2>&gt; 16:55:23 [NOTICE&nbsp; ] [SYNC \
] Synchronization barrier completed Mar 27</FONT> <BR><FONT SIZE=2>&gt; 16:55:23 \
[NOTICE&nbsp; ] [SYNC ] Synchronization barrier completed Couldn't</FONT> <BR><FONT \
SIZE=2>&gt; send token to addr 10.1.1.1 Invalid argument 4 !!!!!! Interface is</FONT> \
<BR><FONT SIZE=2>&gt; down binding to loopback addr. !!!!! Mar 27 16:55:33 \
[NOTICE&nbsp; ] [GMI </FONT> <BR><FONT SIZE=2>&gt; ] The network interface is down. \
Mar 27 16:55:33 [NOTICE&nbsp; ] [GMI&nbsp; ]</FONT> <BR><FONT SIZE=2>&gt; entering \
GATHER state. Mar 27 16:55:33 [NOTICE&nbsp; ] [GMI&nbsp; ] entering</FONT> <BR><FONT \
SIZE=2>&gt; GATHER state. Mar 27 16:55:33 [NOTICE&nbsp; ] [GMI&nbsp; ] Creating \
commit token</FONT> <BR><FONT SIZE=2>&gt; because I am the rep. Mar 27 16:55:33 \
[NOTICE&nbsp; ] [GMI&nbsp; ] Storing new</FONT> <BR><FONT SIZE=2>&gt; sequence id for \
ring 44 Mar 27 16:55:33 [NOTICE&nbsp; ] [GMI&nbsp; ] entering</FONT> <BR><FONT \
SIZE=2>&gt; COMMIT state. position [0] member 127.0.0.1: previous ring seq 40 \
rep</FONT> <BR><FONT SIZE=2>&gt; 10.1.1.1 aru 6 high delivered 6 received flag 1 Mar \
27 16:55:33</FONT> <BR><FONT SIZE=2>&gt; [NOTICE&nbsp; ] [GMI&nbsp; ] entering \
RECOVERY state. Mar 27 16:55:33 [NOTICE&nbsp; ]</FONT> <BR><FONT SIZE=2>&gt; \
[GMI&nbsp; ] Sending initial ORF token token retrans flag is 0 my set</FONT> \
<BR><FONT SIZE=2>&gt; retrans flag1 retrans queue empty 1 count 0, low_water 0 aru 0 \
install</FONT> <BR><FONT SIZE=2>&gt; seq 0 aru 0 high seq received 0 token retrans \
flag is 0 my set retrans</FONT> <BR><FONT SIZE=2>&gt; flag1 retrans queue empty 1 \
count 1, low_water 0 aru 0 install seq 0</FONT> <BR><FONT SIZE=2>&gt; aru 0 high seq \
received 0 token retrans flag is 0 my set retrans flag1</FONT> <BR><FONT SIZE=2>&gt; \
retrans queue empty 1 count 2, low_water 0 aru 0 install seq 0 aru 0</FONT> <BR><FONT \
SIZE=2>&gt; high seq received 0 token retrans flag is 0 my set retrans flag1</FONT> \
<BR><FONT SIZE=2>&gt; retrans queue empty 1 count 3, low_water 0 aru 0 install seq 0 \
aru 0</FONT> <BR><FONT SIZE=2>&gt; high seq received 0 retrans flag count 4 token aru \
0 install seq 0 aru</FONT> <BR><FONT SIZE=2>&gt; 0 0 recovery to regular 1-0 \
Delivering to app 6 to 0 Mar 27 16:55:33</FONT> <BR><FONT SIZE=2>&gt; [NOTICE&nbsp; ] \
[CLM&nbsp; ] CLM CONFIGURATION CHANGE Mar 27 16:55:33 [NOTICE&nbsp; ]</FONT> \
<BR><FONT SIZE=2>&gt; [CLM&nbsp; ] New Configuration: Mar 27 16:55:33 [NOTICE&nbsp; ] \
[CLM&nbsp; ] Members</FONT> <BR><FONT SIZE=2>&gt; Left:</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Mar 27 16:55:33 [NOTICE&nbsp; ] [CLM&nbsp; \
]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 10.1.1.1</FONT> <BR><FONT SIZE=2>&gt; Mar 27 16:55:33 \
[NOTICE&nbsp; ] [CLM&nbsp; ] Members Joined:</FONT> <BR><FONT SIZE=2>&gt; Mar 27 \
16:55:33 [NOTICE&nbsp; ] [CLM&nbsp; ] CLM CONFIGURATION CHANGE</FONT> <BR><FONT \
SIZE=2>&gt; Mar 27 16:55:33 [NOTICE&nbsp; ] [CLM&nbsp; ] New Configuration:</FONT> \
<BR><FONT SIZE=2>&gt; Mar 27 16:55:33 [NOTICE&nbsp; ] [CLM&nbsp; \
]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 127.0.0.1</FONT> <BR><FONT SIZE=2>&gt; Mar 27 \
16:55:33 [NOTICE&nbsp; ] [CLM&nbsp; ] Members Left:</FONT> <BR><FONT SIZE=2>&gt; Mar \
27 16:55:33 [NOTICE&nbsp; ] [CLM&nbsp; ] Members Joined:</FONT> <BR><FONT SIZE=2>&gt; \
Mar 27 16:55:33 [NOTICE&nbsp; ] [CLM&nbsp; ]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \
127.0.0.1</FONT> <BR><FONT SIZE=2>&gt; Mar 27 16:55:33 [NOTICE&nbsp; ] [GMI&nbsp; ] \
entering OPERATIONAL state. Mar 27</FONT> <BR><FONT SIZE=2>&gt; 16:55:33 \
[NOTICE&nbsp; ] [SYNC ] Synchronization barrier completed Mar 27</FONT> <BR><FONT \
SIZE=2>&gt; 16:55:33 [NOTICE&nbsp; ] [CLM&nbsp; ] got nodejoin message 10.1.1.1 Mar \
27</FONT> <BR><FONT SIZE=2>&gt; 16:55:33 [NOTICE&nbsp; ] [SYNC ] Synchronization \
barrier completed Mar 27</FONT> <BR><FONT SIZE=2>&gt; 16:55:33 [NOTICE&nbsp; ] [SYNC \
] Synchronization barrier completed</FONT> <BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Mar 27 16:55:33 [ERROR&nbsp;&nbsp; ] [EVT&nbsp; ] recovery \
error node: 127.0.0.1 not</FONT> <BR><FONT SIZE=2>&gt; found</FONT>
<BR><FONT SIZE=2>&gt; aisexec: evt.c:3812: evt_sync_process: Assertion `0' failed. \
Aborted</FONT> <BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; -----Original Message-----</FONT>
<BR><FONT SIZE=2>&gt; From: Steven Dake [<A \
HREF="mailto:sdake@mvista.com">mailto:sdake@mvista.com</A>] </FONT> <BR><FONT \
SIZE=2>&gt; Sent: Friday, March 25, 2005 4:49 PM</FONT> <BR><FONT SIZE=2>&gt; To: \
Bajpai, Muni [NGC:B670:EXCH]</FONT> <BR><FONT SIZE=2>&gt; Cc: Smith, Kristen \
[NGC:B670:EXCH]; Openais List</FONT> <BR><FONT SIZE=2>&gt; Subject: RE: Bug 169. \
Assert on Ifdown</FONT> <BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; On Fri, 2005-03-25 at 15:28, Muni Bajpai wrote:</FONT>
<BR><FONT SIZE=2>&gt; &gt; Hey Steve,</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; I tried out the updates to the patch and the reason why \
testevs is</FONT> <BR><FONT SIZE=2>&gt; not </FONT>
<BR><FONT SIZE=2>&gt; &gt; printing anything is because it is blocking on dispatch \
that never </FONT> <BR><FONT SIZE=2>&gt; &gt; arrives.</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; yes but I think this is not the root cause.&nbsp; I believe \
that totem is</FONT> <BR><FONT SIZE=2>&gt; not ordering messages at all because they \
are not being self delivered</FONT> <BR><FONT SIZE=2>&gt; to the \
message_handler_mcast code.</FONT> <BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; Now the other issue is that on ifdowning an interface the \
mcast and </FONT> <BR><FONT SIZE=2>&gt; &gt; unicast's to self are also not \
delivered. This is a problem because </FONT> <BR><FONT SIZE=2>&gt; &gt; most openais \
transactions are request/response pairs and the absence</FONT> <BR><FONT SIZE=2>&gt; \
&gt; of</FONT> <BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; The response caused unpredictable results.</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Muni</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; I think the unicasts to self are delivered, just not the \
multicasts. </FONT> <BR><FONT SIZE=2>&gt; There is a flag which is specified in the \
bind command which tells the</FONT> <BR><FONT SIZE=2>&gt; stack to self-deliver \
multicast messages.&nbsp; It appears for some reason</FONT> <BR><FONT SIZE=2>&gt; \
this is not occuring.&nbsp; Maybe the socket couldn't be bound, or some</FONT> \
<BR><FONT SIZE=2>&gt; other</FONT> <BR><FONT SIZE=2>&gt; issue occurs.</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; We need to have a way to ensure self delivery even after the \
interface</FONT> <BR><FONT SIZE=2>&gt; is downed.&nbsp; I'm not sure if there is some \
form of network trick we</FONT> <BR><FONT SIZE=2>&gt; could</FONT>
<BR><FONT SIZE=2>&gt; use..&nbsp; maybe open a localhost socket and use that for \
mcast.&nbsp; Or maybe</FONT> <BR><FONT SIZE=2>&gt; set the mcast transmit socket to \
the unicast socket and unicast</FONT> <BR><FONT SIZE=2>&gt; transmit</FONT>
<BR><FONT SIZE=2>&gt; multicast and membership messages.&nbsp; This last approach is \
probably the</FONT> <BR><FONT SIZE=2>&gt; easiest, requiring some minor changes \
to:</FONT> <BR><FONT SIZE=2>&gt; set the multicast fd to the unicast fd and</FONT>
<BR><FONT SIZE=2>&gt; set the multicast target address to the unicast address</FONT>
<BR><FONT SIZE=2>&gt; when:</FONT>
<BR><FONT SIZE=2>&gt; the ring has 1 or less members and</FONT>
<BR><FONT SIZE=2>&gt; the interface is detected as down</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; Also while debugging I saw an anomaly where totemsrp was \
polling</FONT> <BR><FONT SIZE=2>&gt; &gt; The libais fd in addition to main.c. I \
don't know how that happened</FONT> <BR><FONT SIZE=2>&gt; &gt; But that was an easy \
fix where the totemsrp dispatch is started only</FONT> <BR><FONT SIZE=2>&gt; &gt; \
when the network is up.</FONT> <BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Ok i think this is the root of our problems!&nbsp; This could \
indicate one</FONT> <BR><FONT SIZE=2>&gt; of</FONT>
<BR><FONT SIZE=2>&gt; the socket/bind calls didn't complete successfully during the \
binding</FONT> <BR><FONT SIZE=2>&gt; process.&nbsp; If this is the case, then later \
that same fd is added by the</FONT> <BR><FONT SIZE=2>&gt; libais \
dispatch.&nbsp;&nbsp; We should track down the source of this failure and</FONT> \
<BR><FONT SIZE=2>&gt; fix the error return...</FONT> <BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; We really want totemsrp to dispatch multicasts and unicasts \
even when</FONT> <BR><FONT SIZE=2>&gt; the network interface is down.&nbsp; We just \
have to figure out how this</FONT> <BR><FONT SIZE=2>&gt; should be done.</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Another possible option when the interface is down is to create \
a pipe</FONT> <BR><FONT SIZE=2>&gt; using the pipe system call and using one end for \
the send, and one end</FONT> <BR><FONT SIZE=2>&gt; for the receive/delivery.&nbsp; \
This way we can avoid the bind process</FONT> <BR><FONT SIZE=2>&gt; entirely.&nbsp; \
I'd rather not make major changes to any of the sendmsg</FONT> <BR><FONT SIZE=2>&gt; \
calls</FONT> <BR><FONT SIZE=2>&gt; in totem, but rather, would have this \
functionality in the initialize</FONT> <BR><FONT SIZE=2>&gt; phase or during network \
up/down detection.&nbsp; I'd like to keep the rest</FONT> <BR><FONT SIZE=2>&gt; \
of</FONT> <BR><FONT SIZE=2>&gt; the code as minimally impacted as possible.</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; In rethinking this issue I feel that the best way to \
handle the</FONT> <BR><FONT SIZE=2>&gt; ifdown</FONT>
<BR><FONT SIZE=2>&gt; &gt; Is a graceful shutdown (This being different from network \
isolation</FONT> <BR><FONT SIZE=2>&gt; &gt; where</FONT>
<BR><FONT SIZE=2>&gt; &gt; A cluster of one is formed). Otherwise we have to \
guarantee all</FONT> <BR><FONT SIZE=2>&gt; &gt; outstanding</FONT>
<BR><FONT SIZE=2>&gt; &gt; Transactions requests be queued for retransmits.</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; It is suboptimal to absolutely require a network interface for \
openais</FONT> <BR><FONT SIZE=2>&gt; to operate.&nbsp; I can think of situations \
where applications may not want</FONT> <BR><FONT SIZE=2>&gt; the network interface \
up, but still want to provide ais style</FONT> <BR><FONT SIZE=2>&gt; services. \
</FONT> <BR><FONT SIZE=2>&gt; The graceful shutdown approach would mean that ais \
would be unable to</FONT> <BR><FONT SIZE=2>&gt; provide any services unless a network \
interface is available.&nbsp; I am</FONT> <BR><FONT SIZE=2>&gt; not</FONT>
<BR><FONT SIZE=2>&gt; fond of this approach..</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; What do you think the expected behavior should be on a nic \
failure</FONT> <BR><FONT SIZE=2>&gt; ??</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Nic failure (meaining that the nic is totally unable to \
transmit or</FONT> <BR><FONT SIZE=2>&gt; receive any data) should cause two \
configurations 1 with the processor</FONT> <BR><FONT SIZE=2>&gt; containing the \
defective nic but still be able to provide services to</FONT> <BR><FONT SIZE=2>&gt; \
applications.&nbsp; The remaining processors should be in the second</FONT> <BR><FONT \
SIZE=2>&gt; configuration.</FONT> <BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; What the applications do in this scenario should be \
application</FONT> <BR><FONT SIZE=2>&gt; specific.&nbsp; They could stonith the \
failed processor, or they could</FONT> <BR><FONT SIZE=2>&gt; continue to allow it to \
operate as a seperate partition.&nbsp; At the</FONT> <BR><FONT SIZE=2>&gt; \
moment</FONT> <BR><FONT SIZE=2>&gt; it is not our concern to solve this particular \
application specific</FONT> <BR><FONT SIZE=2>&gt; issue..</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; Thanks</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; Muni</FONT>
<BR><FONT SIZE=2>&gt; &gt; -----Original Message-----</FONT>
<BR><FONT SIZE=2>&gt; &gt; From: Steven Dake [<A \
HREF="mailto:sdake@mvista.com">mailto:sdake@mvista.com</A>] </FONT> <BR><FONT \
SIZE=2>&gt; &gt; Sent: Thursday, March 24, 2005 5:58 PM</FONT> <BR><FONT SIZE=2>&gt; \
&gt; To: Bajpai, Muni [NGC:B670:EXCH]</FONT> <BR><FONT SIZE=2>&gt; &gt; Cc: Smith, \
Kristen [NGC:B670:EXCH]; Openais List</FONT> <BR><FONT SIZE=2>&gt; &gt; Subject: RE: \
Bug 169. Assert on Ifdown</FONT> <BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; Muni</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; the patch is a good start but has some problems</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; I spent 30 mins or so debugging and have attached a \
patch.</FONT> <BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; Find attached my patch (to your patch) which fixes most of \
the</FONT> <BR><FONT SIZE=2>&gt; &gt; problems.&nbsp; Please note the ioctl to get \
the IFF flags is required. </FONT> <BR><FONT SIZE=2>&gt; &gt; The</FONT>
<BR><FONT SIZE=2>&gt; &gt; linux kernel doesn't copy them on the other ioctl \
operation (even</FONT> <BR><FONT SIZE=2>&gt; &gt; though</FONT>
<BR><FONT SIZE=2>&gt; &gt; I think it should but oh well).</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; I also sanitized the up down state detection.</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; One problem still remains, however.&nbsp; Perhaps you \
could look at this</FONT> <BR><FONT SIZE=2>&gt; &gt; test</FONT>
<BR><FONT SIZE=2>&gt; &gt; case and get it to work.</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; processor A, processor B within a configuration</FONT>
<BR><FONT SIZE=2>&gt; &gt; ifdown the interface on processor A</FONT>
<BR><FONT SIZE=2>&gt; &gt; * run testevs on processor A which generates no output \
(and doesn't</FONT> <BR><FONT SIZE=2>&gt; &gt; order messages)</FONT>
<BR><FONT SIZE=2>&gt; &gt; ifup the interface on processor A</FONT>
<BR><FONT SIZE=2>&gt; &gt; run testevs on processor A which generates correct output \
(and</FONT> <BR><FONT SIZE=2>&gt; orders</FONT>
<BR><FONT SIZE=2>&gt; &gt; messages).</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; fix the * </FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; it appears the processor doesn't order messages (maybe \
they are not</FONT> <BR><FONT SIZE=2>&gt; &gt; being self-delivered since the \
interface is now down)..&nbsp; We need to</FONT> <BR><FONT SIZE=2>&gt; &gt; \
get</FONT> <BR><FONT SIZE=2>&gt; &gt; this to work for the case where the interface \
is downed, but we</FONT> <BR><FONT SIZE=2>&gt; still</FONT>
<BR><FONT SIZE=2>&gt; &gt; want operational applications on the processor.</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; Regards</FONT>
<BR><FONT SIZE=2>&gt; &gt; -steve</FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; On Thu, 2005-03-24 at 15:02, Muni Bajpai wrote:</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; Hey Steve,</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; Please review the patch for defect 169. One issue is \
still</FONT> <BR><FONT SIZE=2>&gt; &gt; outstanding</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; on this patch though.</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; Here is the scenario</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; 1.) Node 1 and Node 2 are up. Node 1 has the ckpt-wr \
invoked</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; 2.) Node2 ifdown</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; 3.) Node2 ifup and the cluster is reformed.</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; If the ckpt-wr is not running the step 3 above \
results in 2</FONT> <BR><FONT SIZE=2>&gt; &gt; individual</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; clusters as the JOIN from node2 is never delivered to \
openais</FONT> <BR><FONT SIZE=2>&gt; after</FONT>
<BR><FONT SIZE=2>&gt; &gt; it</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; is taken off the wire by Node1. I cant seem to figure \
out why that</FONT> <BR><FONT SIZE=2>&gt; &gt; is</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; happening ?</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; Thanks</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; Muni</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; -----Original Message-----</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; From: Steven Dake [<A \
HREF="mailto:sdake@mvista.com">mailto:sdake@mvista.com</A>] </FONT> <BR><FONT \
SIZE=2>&gt; &gt; &gt; Sent: Wednesday, March 23, 2005 6:18 PM</FONT> <BR><FONT \
SIZE=2>&gt; &gt; &gt; To: Bajpai, Muni [NGC:B670:EXCH]</FONT> <BR><FONT SIZE=2>&gt; \
&gt; &gt; Cc: Smith, Kristen [NGC:B670:EXCH]; Openais List</FONT> <BR><FONT \
SIZE=2>&gt; &gt; &gt; Subject: RE: Bug 169. Assert on Ifdown</FONT> <BR><FONT \
SIZE=2>&gt; &gt; &gt; </FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; Muni</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; Thanks for considering resurrecting the patch for \
this defect. </FONT> <BR><FONT SIZE=2>&gt; All</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; the work I did on defect 169 is in the attached \
tarball.</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; If you diff defect-169.orig and defect-169 you should \
be able to</FONT> <BR><FONT SIZE=2>&gt; see</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; the changes I made to support this.</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; This tarball also seems to include a bunch of other \
interim</FONT> <BR><FONT SIZE=2>&gt; patches</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; which you should not include.&nbsp; I suggest making \
a patch against</FONT> <BR><FONT SIZE=2>&gt; the</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; new totemsrp code (keep in mind the defect was worked \
against gmi,</FONT> <BR><FONT SIZE=2>&gt; &gt; the</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; old ring protocol).&nbsp; Most or all of the changes \
are in or related</FONT> <BR><FONT SIZE=2>&gt; to</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; the following symbols:</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; netif_down_report_down</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; netif_down_check</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; netif_determine</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; build_sockets</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; timer_function_netif_check_timeout</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; The asserts should be removed.</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; You will probably not be able to cleanly apply this \
patch but</FONT> <BR><FONT SIZE=2>&gt; &gt; instead</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; have to hand apply the changes and then do some debug \
work to</FONT> <BR><FONT SIZE=2>&gt; clean</FONT>
<BR><FONT SIZE=2>&gt; &gt; it</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; up.</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; The test case is to ifdown the interface in use while \
messages are</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; being transmitted (one of the \
benchmark programs is a good message</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; \
generator), then ifup the interface and make sure the ring reforms</FONT> <BR><FONT \
SIZE=2>&gt; &gt; &gt; when the interface comes back up.</FONT> <BR><FONT SIZE=2>&gt; \
&gt; &gt; </FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; To keep the protocol running while \
the interface is down, you will</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; need UDP \
messages to be self-delivered on the multicast port and</FONT> <BR><FONT SIZE=2>&gt; \
&gt; &gt; unicast port.&nbsp; This is required so that the membership protocol</FONT> \
<BR><FONT SIZE=2>&gt; can</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; form an OPERATIONAL \
configuration and then order messages based</FONT> <BR><FONT SIZE=2>&gt; upon</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; token reception.&nbsp; I'm not sure how to ensure \
self-delivery of UDP</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; messages without looking \
at the code in more detail.&nbsp; As I recall,</FONT> <BR><FONT SIZE=2>&gt; &gt; \
the</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; current work closes the bound sockets as \
soon as the network</FONT> <BR><FONT SIZE=2>&gt; &gt; interface</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; is detected down.&nbsp; Closing the fds would likely \
result in non-self</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; delivery and also token \
timeouts sine the token isn't being</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; \
delivered.&nbsp; This is the most difficult issue of the problem.</FONT> <BR><FONT \
SIZE=2>&gt; &gt; &gt; </FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; The solution in the \
previous patch was to still sendmsg the</FONT> <BR><FONT SIZE=2>&gt; messages</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; when the interface is down and ignore any errors that \
occur.&nbsp; I am</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; satisfied with this \
approach, but we must also continue to ensure</FONT> <BR><FONT SIZE=2>&gt; &gt; \
the</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; protocol stays in the OPERATIONAL state \
and executes the state</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; machines as \
required.</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; Please note during development we tried not closing \
the socket. </FONT> <BR><FONT SIZE=2>&gt; &gt; This</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; works on some linux systems but not on others.&nbsp; \
The only safe</FONT> <BR><FONT SIZE=2>&gt; &gt; solution</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; is to close the socket, then repoen it when the \
network interface</FONT> <BR><FONT SIZE=2>&gt; is</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; ifup'ed.</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; Thanks</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; -steve</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; On Wed, 2005-03-23 at 15:54, Muni Bajpai \
wrote:</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; &gt; Sure go ahead and send me the \
tarball, but if the real deal is</FONT> <BR><FONT SIZE=2>&gt; to</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; let </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; the protocol running what changes do you think \
would entail that</FONT> <BR><FONT SIZE=2>&gt; ?</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; I'm not sure if I have the bandwidth to look at \
it but send me</FONT> <BR><FONT SIZE=2>&gt; the</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; code. Could you also cursively enumerate the \
items that require</FONT> <BR><FONT SIZE=2>&gt; &gt; work</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; for this fix</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; This is a high priority for us to be truly a \
redundant system.</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; Once again thanks</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; Muni</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; -----Original Message-----</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; From: Steven Dake [<A \
HREF="mailto:sdake@mvista.com">mailto:sdake@mvista.com</A>]</FONT> <BR><FONT \
SIZE=2>&gt; &gt; &gt; &gt; Sent: Wednesday, March 23, 2005 4:37 PM</FONT> <BR><FONT \
SIZE=2>&gt; &gt; &gt; &gt; To: Bajpai, Muni [NGC:B670:EXCH]</FONT> <BR><FONT \
SIZE=2>&gt; &gt; &gt; &gt; Cc: Smith, Kristen [NGC:B670:EXCH]; Openais List</FONT> \
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; Subject: Re: Bug 169. Assert on Ifdown</FONT> \
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; &gt; \
</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; &gt; I have a patch sitting around on my \
harddisk somewhere</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; Basically what it does is monitor the state of \
the network</FONT> <BR><FONT SIZE=2>&gt; device.</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; When it goes up, the protocol is started and \
bound to the</FONT> <BR><FONT SIZE=2>&gt; network</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; interfaces.</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; When it goes down, the protocol is \
stopped.</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; This allows the interface to be upped or downed \
on command.</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; What I really want is a way for the protocol to \
operate when the</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; &gt; network interface is \
downed.&nbsp; But then when it comes back up, it</FONT> <BR><FONT SIZE=2>&gt; &gt; \
&gt; will </FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; &gt; try to form a \
configuration.</FONT> <BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; If your interested in resurrecting this patch, I \
can probably</FONT> <BR><FONT SIZE=2>&gt; send</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; you </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; a tarball to review the techniques for \
monitoring the up/down</FONT> <BR><FONT SIZE=2>&gt; &gt; state</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; of </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; the network device...</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; If not, I'll get to it after 188 and the sq_add \
assert is</FONT> <BR><FONT SIZE=2>&gt; &gt; resolved.</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; Regards</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; -steve</FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; &gt;&nbsp; </FONT>
<BR><FONT SIZE=2>&gt; &gt; </FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt;&nbsp; </FONT>
</P>
<BR>

<P><FONT FACE="Arial" SIZE=2 COLOR="#000000"></FONT>&nbsp;

</BODY>
</HTML>


["defect-169_rev5.patch" (application/octet-stream)]

diff -uNr --exclude=SCCS --exclude=BitKeeper --exclude=ChangeSet --exclude=init \
--exclude=LICENSE --exclude=Makefile --exclude=man --exclude=README.devmap \
--exclude=SECURITY --exclude=TODO --exclude=CHANGELOG --exclude=conf --exclude=loc \
--exclude=Makefile.samples --exclude=QUICKSTART --exclude=test --exclude=.cdtproject \
                --exclude=.project latest/exec/amf.c openais/exec/amf.c
--- latest/exec/amf.c	2005-03-28 17:38:29.000000000 -0600
+++ openais/exec/amf.c	2005-03-28 18:22:25.000000000 -0600
@@ -1847,7 +1847,7 @@
 
 void amf_confchg_njoin (struct saAmfComponent *component ,void *data)
 {
-	if (component->source_addr.s_addr != this_ip.sin_addr.s_addr) {
+	if (component->source_addr.s_addr != this_ip->sin_addr.s_addr) {
 		return;
 	}
 
@@ -2054,7 +2054,7 @@
 	 * If this node originated the request to the cluster, respond back
 	 * to the AMF library
 	 */
-	if (req_exec_amf_componentregister->source.in_addr.s_addr == \
this_ip.sin_addr.s_addr) { +	if \
(req_exec_amf_componentregister->source.in_addr.s_addr == this_ip->sin_addr.s_addr) { \
if (error == SA_OK) {  component->local = 1;
 			req_exec_amf_componentregister->source.conn_info->component = component;
@@ -2096,7 +2096,7 @@
 	amfProxyComponent = findComponent \
(&req_exec_amf_componentregister->req_lib_amf_componentregister.proxyCompName);  
 	/* If this node is Component onwer */
-	if (component->source_addr.s_addr == this_ip.sin_addr.s_addr) {
+	if (component->source_addr.s_addr == this_ip->sin_addr.s_addr) {
 
 		/* No Operation */
 		return;
@@ -2191,7 +2191,7 @@
 	 * If this node originated the request to the cluster, respond back
 	 * to the AMF library
 	 */
-	if (req_exec_amf_componentunregister->source.in_addr.s_addr == \
this_ip.sin_addr.s_addr) { +	if \
(req_exec_amf_componentunregister->source.in_addr.s_addr == this_ip->sin_addr.s_addr) \
{  log_printf (LOG_LEVEL_DEBUG, "sending component unregister response to fd %d\n",
 			req_exec_amf_componentunregister->source.conn_info->fd);
 
@@ -2232,7 +2232,7 @@
 	 * If this node originated the request to the cluster, respond back
 	 * to the AMF library
 	 */
-	if (req_exec_amf_errorreport->source.in_addr.s_addr == this_ip.sin_addr.s_addr) {
+	if (req_exec_amf_errorreport->source.in_addr.s_addr == this_ip->sin_addr.s_addr) {
 		log_printf (LOG_LEVEL_DEBUG, "sending error report response to fd %d\n",
 			req_exec_amf_errorreport->source.conn_info->fd);
 
@@ -2275,7 +2275,7 @@
 	 * If this node originated the request to the cluster, respond back
 	 * to the AMF library
 	 */
-	if (req_exec_amf_errorcancelall->source.in_addr.s_addr == this_ip.sin_addr.s_addr) \
{ +	if (req_exec_amf_errorcancelall->source.in_addr.s_addr == \
this_ip->sin_addr.s_addr) {  log_printf (LOG_LEVEL_DEBUG, "sending error report \
response to fd %d\n",  req_exec_amf_errorcancelall->source.conn_info->fd);
 
@@ -2410,7 +2410,7 @@
 	req_exec_amf_componentregister.header.id = MESSAGE_REQ_EXEC_AMF_COMPONENTREGISTER;
 
 	req_exec_amf_componentregister.source.conn_info = conn_info;
-	req_exec_amf_componentregister.source.in_addr.s_addr = this_ip.sin_addr.s_addr;
+	req_exec_amf_componentregister.source.in_addr.s_addr = this_ip->sin_addr.s_addr;
 	memcpy (&req_exec_amf_componentregister.req_lib_amf_componentregister,
 		req_lib_amf_componentregister,
 		sizeof (struct req_lib_amf_componentregister));
@@ -2435,7 +2435,7 @@
 	req_exec_amf_componentunregister.header.id = \
MESSAGE_REQ_EXEC_AMF_COMPONENTUNREGISTER;  
 	req_exec_amf_componentunregister.source.conn_info = conn_info;
-	req_exec_amf_componentunregister.source.in_addr.s_addr = this_ip.sin_addr.s_addr;
+	req_exec_amf_componentunregister.source.in_addr.s_addr = this_ip->sin_addr.s_addr;
 	memcpy (&req_exec_amf_componentunregister.req_lib_amf_componentunregister,
 		req_lib_amf_componentunregister,
 		sizeof (struct req_lib_amf_componentunregister));
@@ -2616,7 +2616,7 @@
 	req_exec_amf_errorreport.header.id = MESSAGE_REQ_EXEC_AMF_ERRORREPORT;
 
 	req_exec_amf_errorreport.source.conn_info = conn_info;
-	req_exec_amf_errorreport.source.in_addr.s_addr = this_ip.sin_addr.s_addr;
+	req_exec_amf_errorreport.source.in_addr.s_addr = this_ip->sin_addr.s_addr;
 	memcpy (&req_exec_amf_errorreport.req_lib_amf_errorreport,
 		req_lib_amf_errorreport,
 		sizeof (struct req_lib_amf_errorreport));
@@ -2646,7 +2646,7 @@
 	req_exec_amf_errorcancelall.header.id = MESSAGE_REQ_EXEC_AMF_ERRORCANCELALL;
 
 	req_exec_amf_errorcancelall.source.conn_info = conn_info;
-	req_exec_amf_errorcancelall.source.in_addr.s_addr = this_ip.sin_addr.s_addr;
+	req_exec_amf_errorcancelall.source.in_addr.s_addr = this_ip->sin_addr.s_addr;
 	memcpy (&req_exec_amf_errorcancelall.req_lib_amf_errorcancelall,
 		req_lib_amf_errorcancelall,
 		sizeof (struct req_lib_amf_errorcancelall));
diff -uNr --exclude=SCCS --exclude=BitKeeper --exclude=ChangeSet --exclude=init \
--exclude=LICENSE --exclude=Makefile --exclude=man --exclude=README.devmap \
--exclude=SECURITY --exclude=TODO --exclude=CHANGELOG --exclude=conf --exclude=loc \
--exclude=Makefile.samples --exclude=QUICKSTART --exclude=test --exclude=.cdtproject \
                --exclude=.project latest/exec/ckpt.c openais/exec/ckpt.c
--- latest/exec/ckpt.c	2005-03-28 17:38:29.000000000 -0600
+++ openais/exec/ckpt.c	2005-03-28 18:24:46.000000000 -0600
@@ -516,7 +516,7 @@
 					memcpy(&request_exec_sync_state.sectionDescriptor,
 							&ckptCheckpointSection->sectionDescriptor,
 							sizeof(SaCkptSectionDescriptorT));						
-					memcpy(&request_exec_sync_state.source_addr, &this_ip.sin_addr, sizeof(struct \
in_addr)); +					memcpy(&request_exec_sync_state.source_addr, &this_ip->sin_addr, \
sizeof(struct in_addr));  
 					memcpy(request_exec_sync_state.ckpt_refcount,
 							checkpoint->ckpt_refcount,
@@ -938,7 +938,7 @@
 	 *  Initialize the saved ring ID.
 	 */
 	saved_ring_id.seq = 0;
-	saved_ring_id.rep.s_addr = this_ip.sin_addr.s_addr;		
+	saved_ring_id.rep.s_addr = this_ip->sin_addr.s_addr;		
 	
 #ifdef TODO
 	int res;
@@ -1120,7 +1120,7 @@
 	/*
 	 * If this node was the source of the message, respond to this node
 	 */
-	if (req_exec_ckpt_checkpointopen->source.in_addr.s_addr == this_ip.sin_addr.s_addr) \
{ +	if (req_exec_ckpt_checkpointopen->source.in_addr.s_addr == \
this_ip->sin_addr.s_addr) {  res_lib_ckpt_checkpointopen.header.size = sizeof (struct \
res_lib_ckpt_checkpointopen);  res_lib_ckpt_checkpointopen.header.id = \
MESSAGE_RES_CKPT_CHECKPOINT_CHECKPOINTOPEN;  res_lib_ckpt_checkpointopen.header.error \
= error; @@ -1399,7 +1399,7 @@
 	}
 	
 error_exit:
-	if (req_exec_ckpt_checkpointclose->source.in_addr.s_addr == \
this_ip.sin_addr.s_addr) { +	if (req_exec_ckpt_checkpointclose->source.in_addr.s_addr \
== this_ip->sin_addr.s_addr) {  ckpt_checkpoint_remove_cleanup \
(req_exec_ckpt_checkpointclose->source.conn_info,  checkpoint);
 
@@ -1447,7 +1447,7 @@
 	/*
 	 * If this node was the source of the message, respond to this node
 	 */
-	if (req_exec_ckpt_checkpointunlink->source.in_addr.s_addr == \
this_ip.sin_addr.s_addr) { +	if \
(req_exec_ckpt_checkpointunlink->source.in_addr.s_addr == this_ip->sin_addr.s_addr) { \
res_lib_ckpt_checkpointunlink.header.size = sizeof (struct \
res_lib_ckpt_checkpointunlink);  res_lib_ckpt_checkpointunlink.header.id = \
MESSAGE_RES_CKPT_CHECKPOINT_CHECKPOINTUNLINK;  \
res_lib_ckpt_checkpointunlink.header.error = error; @@ -1483,7 +1483,7 @@
 	/*
 	 * Respond to library if this processor sent the duration set request
 	 */
-	if (req_exec_ckpt_checkpointretentiondurationset->source.in_addr.s_addr == \
this_ip.sin_addr.s_addr) { +	if \
(req_exec_ckpt_checkpointretentiondurationset->source.in_addr.s_addr == \
this_ip->sin_addr.s_addr) {  res_lib_ckpt_checkpointretentiondurationset.header.size \
= sizeof (struct res_lib_ckpt_checkpointretentiondurationset);  \
res_lib_ckpt_checkpointretentiondurationset.header.id = \
MESSAGE_RES_CKPT_CHECKPOINT_CHECKPOINTRETENTIONDURATIONSET;  \
res_lib_ckpt_checkpointretentiondurationset.header.error = SA_AIS_OK; @@ -1717,7 \
+1717,7 @@  &ckptCheckpoint->checkpointSectionsListHead);
 
 error_exit:
-	if (req_exec_ckpt_sectioncreate->source.in_addr.s_addr == this_ip.sin_addr.s_addr) \
{ +	if (req_exec_ckpt_sectioncreate->source.in_addr.s_addr == \
this_ip->sin_addr.s_addr) {  res_lib_ckpt_sectioncreate.header.size = sizeof (struct \
res_lib_ckpt_sectioncreate);  res_lib_ckpt_sectioncreate.header.id = \
MESSAGE_RES_CKPT_CHECKPOINT_SECTIONCREATE;  res_lib_ckpt_sectioncreate.header.error = \
error; @@ -1771,7 +1771,7 @@
 	 * return result to CKPT library
 	 */
 error_exit:
-	if (req_exec_ckpt_sectiondelete->source.in_addr.s_addr == this_ip.sin_addr.s_addr) \
{ +	if (req_exec_ckpt_sectiondelete->source.in_addr.s_addr == \
this_ip->sin_addr.s_addr) {  res_lib_ckpt_sectiondelete.header.size = sizeof (struct \
res_lib_ckpt_sectiondelete);  res_lib_ckpt_sectiondelete.header.id = \
MESSAGE_RES_CKPT_CHECKPOINT_SECTIONDELETE;  res_lib_ckpt_sectiondelete.header.error = \
error; @@ -1833,7 +1833,7 @@
 	}
 
 error_exit:
-	if (req_exec_ckpt_sectionexpirationtimeset->source.in_addr.s_addr == \
this_ip.sin_addr.s_addr) { +	if \
(req_exec_ckpt_sectionexpirationtimeset->source.in_addr.s_addr == \
this_ip->sin_addr.s_addr) {  res_lib_ckpt_sectionexpirationtimeset.header.size = \
sizeof (struct res_lib_ckpt_sectionexpirationtimeset);  \
res_lib_ckpt_sectionexpirationtimeset.header.id = \
MESSAGE_RES_CKPT_CHECKPOINT_SECTIONEXPIRATIONTIMESET;  \
res_lib_ckpt_sectionexpirationtimeset.header.error = error; @@ -1967,7 +1967,7 @@
 	 * Write write response to CKPT library
 	 */
 error_exit:
-	if (req_exec_ckpt_sectionwrite->source.in_addr.s_addr == this_ip.sin_addr.s_addr) {
+	if (req_exec_ckpt_sectionwrite->source.in_addr.s_addr == this_ip->sin_addr.s_addr) \
{  res_lib_ckpt_sectionwrite.header.size = sizeof (struct res_lib_ckpt_sectionwrite);
 		res_lib_ckpt_sectionwrite.header.id = MESSAGE_RES_CKPT_CHECKPOINT_SECTIONWRITE;
 		res_lib_ckpt_sectionwrite.header.error = error;
@@ -2040,7 +2040,7 @@
 	 * return result to CKPT library
 	 */
 error_exit:
-	if (req_exec_ckpt_sectionoverwrite->source.in_addr.s_addr == \
this_ip.sin_addr.s_addr) { +	if \
(req_exec_ckpt_sectionoverwrite->source.in_addr.s_addr == this_ip->sin_addr.s_addr) { \
res_lib_ckpt_sectionoverwrite.header.size = sizeof (struct \
res_lib_ckpt_sectionoverwrite);  res_lib_ckpt_sectionoverwrite.header.id = \
MESSAGE_RES_CKPT_CHECKPOINT_SECTIONOVERWRITE;  \
res_lib_ckpt_sectionoverwrite.header.error = error; @@ -2107,7 +2107,7 @@
 	 * Write read response to CKPT library
 	 */
 error_exit:
-	if (req_exec_ckpt_sectionread->source.in_addr.s_addr == this_ip.sin_addr.s_addr) {
+	if (req_exec_ckpt_sectionread->source.in_addr.s_addr == this_ip->sin_addr.s_addr) {
 		res_lib_ckpt_sectionread.header.size = sizeof (struct res_lib_ckpt_sectionread) + \
sectionSize;  res_lib_ckpt_sectionread.header.id = \
MESSAGE_RES_CKPT_CHECKPOINT_SECTIONREAD;  res_lib_ckpt_sectionread.header.error = \
error; @@ -2173,7 +2173,7 @@
 	req_exec_ckpt_checkpointopen.header.id = MESSAGE_REQ_EXEC_CKPT_CHECKPOINTOPEN;
 
 	req_exec_ckpt_checkpointopen.source.conn_info = conn_info;
-	req_exec_ckpt_checkpointopen.source.in_addr.s_addr = this_ip.sin_addr.s_addr;
+	req_exec_ckpt_checkpointopen.source.in_addr.s_addr = this_ip->sin_addr.s_addr;
 
 	memcpy (&req_exec_ckpt_checkpointopen.req_lib_ckpt_checkpointopen,
 		req_lib_ckpt_checkpointopen,
@@ -2209,7 +2209,7 @@
 	req_exec_ckpt_checkpointclose.header.id = MESSAGE_REQ_EXEC_CKPT_CHECKPOINTCLOSE;
 
 	req_exec_ckpt_checkpointclose.source.conn_info = conn_info;
-	req_exec_ckpt_checkpointclose.source.in_addr.s_addr = this_ip.sin_addr.s_addr;
+	req_exec_ckpt_checkpointclose.source.in_addr.s_addr = this_ip->sin_addr.s_addr;
 
 	memcpy (&req_exec_ckpt_checkpointclose.checkpointName,
 		&checkpoint->name, sizeof (SaNameT));
@@ -2235,7 +2235,7 @@
 	req_exec_ckpt_checkpointunlink.header.id = MESSAGE_REQ_EXEC_CKPT_CHECKPOINTUNLINK;
 
 	req_exec_ckpt_checkpointunlink.source.conn_info = conn_info;
-	req_exec_ckpt_checkpointunlink.source.in_addr.s_addr = this_ip.sin_addr.s_addr;
+	req_exec_ckpt_checkpointunlink.source.in_addr.s_addr = this_ip->sin_addr.s_addr;
 
 	memcpy (&req_exec_ckpt_checkpointunlink.req_lib_ckpt_checkpointunlink,
 		req_lib_ckpt_checkpointunlink,
@@ -2260,7 +2260,7 @@
 	req_exec_ckpt_checkpointretentiondurationset.header.size = sizeof (struct \
req_exec_ckpt_checkpointretentiondurationset);  
 	req_exec_ckpt_checkpointretentiondurationset.source.conn_info = conn_info;
-	req_exec_ckpt_checkpointretentiondurationset.source.in_addr.s_addr = \
this_ip.sin_addr.s_addr; \
+	req_exec_ckpt_checkpointretentiondurationset.source.in_addr.s_addr = \
this_ip->sin_addr.s_addr;  
 	memcpy (&req_exec_ckpt_checkpointretentiondurationset.checkpointName,
 		&req_lib_ckpt_checkpointretentiondurationset->checkpointName,
@@ -2352,7 +2352,7 @@
 		sizeof (SaNameT));
 
 	req_exec_ckpt_sectioncreate.source.conn_info = conn_info;
-	req_exec_ckpt_sectioncreate.source.in_addr.s_addr = this_ip.sin_addr.s_addr;
+	req_exec_ckpt_sectioncreate.source.in_addr.s_addr = this_ip->sin_addr.s_addr;
 
 	iovecs[0].iov_base = (char *)&req_exec_ckpt_sectioncreate;
 	iovecs[0].iov_len = sizeof (req_exec_ckpt_sectioncreate);
@@ -2405,7 +2405,7 @@
 		sizeof (struct req_lib_ckpt_sectiondelete));
 
 	req_exec_ckpt_sectiondelete.source.conn_info = conn_info;
-	req_exec_ckpt_sectiondelete.source.in_addr.s_addr = this_ip.sin_addr.s_addr;
+	req_exec_ckpt_sectiondelete.source.in_addr.s_addr = this_ip->sin_addr.s_addr;
 
 	iovecs[0].iov_base = (char *)&req_exec_ckpt_sectiondelete;
 	iovecs[0].iov_len = sizeof (req_exec_ckpt_sectiondelete);
@@ -2444,7 +2444,7 @@
 		sizeof (struct req_lib_ckpt_sectionexpirationtimeset));
 
 	req_exec_ckpt_sectionexpirationtimeset.source.conn_info = conn_info;
-	req_exec_ckpt_sectionexpirationtimeset.source.in_addr.s_addr = \
this_ip.sin_addr.s_addr; \
+	req_exec_ckpt_sectionexpirationtimeset.source.in_addr.s_addr = \
this_ip->sin_addr.s_addr;  
 	iovecs[0].iov_base = (char *)&req_exec_ckpt_sectionexpirationtimeset;
 	iovecs[0].iov_len = sizeof (req_exec_ckpt_sectionexpirationtimeset);
@@ -2491,7 +2491,7 @@
 		sizeof (SaNameT));
 
 	req_exec_ckpt_sectionwrite.source.conn_info = conn_info;
-	req_exec_ckpt_sectionwrite.source.in_addr.s_addr = this_ip.sin_addr.s_addr;
+	req_exec_ckpt_sectionwrite.source.in_addr.s_addr = this_ip->sin_addr.s_addr;
 
 	iovecs[0].iov_base = (char *)&req_exec_ckpt_sectionwrite;
 	iovecs[0].iov_len = sizeof (req_exec_ckpt_sectionwrite);
@@ -2538,7 +2538,7 @@
 		sizeof (SaNameT));
 
 	req_exec_ckpt_sectionoverwrite.source.conn_info = conn_info;
-	req_exec_ckpt_sectionoverwrite.source.in_addr.s_addr = this_ip.sin_addr.s_addr;
+	req_exec_ckpt_sectionoverwrite.source.in_addr.s_addr = this_ip->sin_addr.s_addr;
 
 	iovecs[0].iov_base = (char *)&req_exec_ckpt_sectionoverwrite;
 	iovecs[0].iov_len = sizeof (req_exec_ckpt_sectionoverwrite);
@@ -2582,7 +2582,7 @@
 		sizeof (SaNameT));
 
 	req_exec_ckpt_sectionread.source.conn_info = conn_info;
-	req_exec_ckpt_sectionread.source.in_addr.s_addr = this_ip.sin_addr.s_addr;
+	req_exec_ckpt_sectionread.source.in_addr.s_addr = this_ip->sin_addr.s_addr;
 
 	iovecs[0].iov_base = (char *)&req_exec_ckpt_sectionread;
 	iovecs[0].iov_len = sizeof (req_exec_ckpt_sectionread);
diff -uNr --exclude=SCCS --exclude=BitKeeper --exclude=ChangeSet --exclude=init \
--exclude=LICENSE --exclude=Makefile --exclude=man --exclude=README.devmap \
--exclude=SECURITY --exclude=TODO --exclude=CHANGELOG --exclude=conf --exclude=loc \
--exclude=Makefile.samples --exclude=QUICKSTART --exclude=test --exclude=.cdtproject \
                --exclude=.project latest/exec/clm.c openais/exec/clm.c
--- latest/exec/clm.c	2005-03-28 17:38:29.000000000 -0600
+++ openais/exec/clm.c	2005-03-28 18:26:32.000000000 -0600
@@ -199,11 +199,11 @@
 	/*
 	 * Build local cluster node data structure
 	 */
-	thisClusterNode.nodeId = this_ip.sin_addr.s_addr;
-	memcpy (&thisClusterNode.nodeAddress.value, &this_ip.sin_addr,
+	thisClusterNode.nodeId = this_ip->sin_addr.s_addr;
+	memcpy (&thisClusterNode.nodeAddress.value, &this_ip->sin_addr,
 		sizeof (struct in_addr));
 	thisClusterNode.nodeAddress.length = sizeof (struct in_addr);
-	strcpy (thisClusterNode.nodeName.value, (char *)inet_ntoa (this_ip.sin_addr));
+	strcpy (thisClusterNode.nodeName.value, (char *)inet_ntoa (this_ip->sin_addr));
 	thisClusterNode.nodeName.length = strlen (thisClusterNode.nodeName.value);
 	thisClusterNode.member = 1;
 	{
diff -uNr --exclude=SCCS --exclude=BitKeeper --exclude=ChangeSet --exclude=init \
--exclude=LICENSE --exclude=Makefile --exclude=man --exclude=README.devmap \
--exclude=SECURITY --exclude=TODO --exclude=CHANGELOG --exclude=conf --exclude=loc \
--exclude=Makefile.samples --exclude=QUICKSTART --exclude=test --exclude=.cdtproject \
                --exclude=.project latest/exec/main.c openais/exec/main.c
--- latest/exec/main.c	2005-03-28 17:38:29.000000000 -0600
+++ openais/exec/main.c	2005-03-28 18:17:29.000000000 -0600
@@ -163,7 +163,7 @@
 	return;
 }
 
-struct sockaddr_in this_ip;
+struct sockaddr_in *this_ip;
 #define LOCALHOST_IP inet_addr("127.0.0.1")
 
 char *socketname = "libais.socket";
@@ -1003,8 +1003,9 @@
 		0,
 		deliver_fn, confchg_fn);
 	
-	memcpy (&this_ip, &openais_config.interfaces[0].boundto,
-		sizeof (struct sockaddr_in));
+//	memcpy (&this_ip, &openais_config.interfaces[0].boundto,
+//		sizeof (struct sockaddr_in));
+	this_ip = &openais_config.interfaces[0].boundto;
 
 	/*
 	 * Drop root privleges to user 'ais'
diff -uNr --exclude=SCCS --exclude=BitKeeper --exclude=ChangeSet --exclude=init \
--exclude=LICENSE --exclude=Makefile --exclude=man --exclude=README.devmap \
--exclude=SECURITY --exclude=TODO --exclude=CHANGELOG --exclude=conf --exclude=loc \
--exclude=Makefile.samples --exclude=QUICKSTART --exclude=test --exclude=.cdtproject \
                --exclude=.project latest/exec/main.h openais/exec/main.h
--- latest/exec/main.h	2005-03-28 17:38:29.000000000 -0600
+++ openais/exec/main.h	2005-03-28 18:15:42.000000000 -0600
@@ -113,7 +113,7 @@
 	struct ais_ci ais_ci;	/* libais connection information */
 };
 
-extern struct sockaddr_in this_ip;
+extern struct sockaddr_in *this_ip;
 
 poll_handle aisexec_poll_handle;
 
diff -uNr --exclude=SCCS --exclude=BitKeeper --exclude=ChangeSet --exclude=init \
--exclude=LICENSE --exclude=Makefile --exclude=man --exclude=README.devmap \
--exclude=SECURITY --exclude=TODO --exclude=CHANGELOG --exclude=conf --exclude=loc \
--exclude=Makefile.samples --exclude=QUICKSTART --exclude=test --exclude=.cdtproject \
                --exclude=.project latest/exec/totemsrp.c openais/exec/totemsrp.c
--- latest/exec/totemsrp.c	2005-03-28 17:38:29.000000000 -0600
+++ openais/exec/totemsrp.c	2005-03-28 18:53:41.000000000 -0600
@@ -4,8 +4,7 @@
 int log_digest = 0;
 int last_released = 0;
 int set_aru = -1;
-int totemsrp_brake;
-		
+int totemsrp_brake;	
 /*
  * Copyright (c) 2003-2004 MontaVista Software, Inc.
  *
@@ -105,6 +104,8 @@
 #define PACKET_SIZE_MAX					2000
 #define FAIL_TO_RECV_CONST				250
 #define SEQNO_UNCHANGED_CONST			20
+#define TIMEOUT_DOWNCHECK           	1000
+
 
 /*
  * we compare incoming messages to determine if their endian is
@@ -230,6 +231,7 @@
  * Multicast address
  */
 struct sockaddr_in sockaddr_in_mcast;
+struct sockaddr_in sockaddr_in_mcast_startup;
 
 struct totemsrp_socket {
 	int mcast;
@@ -272,6 +274,9 @@
 
 poll_timer_handle memb_timer_state_commit_timeout = 0;
 
+poll_timer_handle timer_netif_check_timeout = 0;
+
+
 /*
  * Function called when new message received
  */
@@ -431,6 +436,10 @@
 		int joined_list_entries,
 	struct memb_ring_id *ring_id) = 0;
 
+static struct totem_interface *totemsrp_interfaces;
+static int totemsrp_interface_count;
+
+
 /*
  * forward decls
  */
@@ -447,10 +456,31 @@
 static void memb_ring_id_create_or_load (struct memb_ring_id *);
 static int recv_handler (poll_handle handle, int fd, int revents, void *data, \
unsigned int *prio);  static int netif_determine (struct sockaddr_in *bindnet, struct \
sockaddr_in *bound_to); +static int loopback_determine (struct sockaddr_in \
*bound_to); +static void netif_down_check (void);
+static int interface_up;
+
+#define NETIF_STATE_REPORT_UP		1	
+#define NETIF_STATE_REPORT_DOWN		2
+
+#define BIND_STATE_UNBOUND	0
+#define BIND_STATE_REGULAR	1
+#define BIND_STATE_LOOPBACK	2
+
+int netif_state_report = NETIF_STATE_REPORT_UP | NETIF_STATE_REPORT_DOWN;
+
+int netif_bind_state = BIND_STATE_UNBOUND;
+
 static int totemsrp_build_sockets (struct sockaddr_in *sockaddr_mcast,
 	struct sockaddr_in *sockaddr_bindnet,
 	struct totemsrp_socket *sockets,
 	struct sockaddr_in *bound_to);
+
+static int totemsrp_build_sockets_loopback (struct sockaddr_in *sockaddr_mcast,
+	struct sockaddr_in *sockaddr_bindnet,
+	struct totemsrp_socket *sockets,
+	struct sockaddr_in *bound_to);
+
 static void memb_state_gather_enter (void);
 static void messages_deliver_to_app (int skip, int *start_point, int end_point);
 static int orf_token_mcast (struct orf_token *oken,
@@ -550,9 +580,6 @@
 		struct memb_ring_id *ring_id))
 {
 
-	int res;
-	int interface_no;
-
 	/*
 	 * Initialize random number generator for later use to generate salt
 	 */
@@ -566,6 +593,7 @@
 	 * Initialize local variables for totemsrp
 	 */
 	memcpy (&sockaddr_in_mcast, sockaddr_mcast, sizeof (struct sockaddr_in));
+	memcpy (&sockaddr_in_mcast_startup, &sockaddr_in_mcast, sizeof (struct \
sockaddr_in));  memset (&next_memb, 0, sizeof (struct sockaddr_in));
 	memset (iov_buffer, 0, PACKET_SIZE_MAX);
 
@@ -581,40 +609,11 @@
 	sq_init (&recovery_sort_queue,
 		QUEUE_RTR_ITEMS_SIZE_MAX, sizeof (struct sort_queue_item), 0);
 
-	/*
-	 * Build sockets for every interface
-	 */
-	for (interface_no = 0; interface_no < interface_count; interface_no++) {
-		/*
-		 * Create and bind the multicast and unicast sockets
-		 */
-		res = totemsrp_build_sockets (sockaddr_mcast,
-			&interfaces[interface_no].bindnet,
-			&totemsrp_sockets[interface_no],
-			&interfaces[interface_no].boundto);
-
-		if (res == -1) {
-			return (res);
-		}
-		totemsrp_poll_handle = poll_handle;
-
-		poll_dispatch_add (*totemsrp_poll_handle, totemsrp_sockets[interface_no].mcast,
-			POLLIN, 0, recv_handler, UINT_MAX);
+	totemsrp_interfaces = interfaces;
+	totemsrp_interface_count = interface_count;
+	totemsrp_poll_handle = poll_handle;
 
-		poll_dispatch_add (*totemsrp_poll_handle, totemsrp_sockets[interface_no].token,
-			POLLIN, 0, recv_handler, UINT_MAX);
-	}
-
-	memcpy (&my_id, &interfaces->boundto, sizeof (struct sockaddr_in));
-
-	/*
-	 * This stuff depends on totemsrp_build_sockets
-	 */
-	my_memb_list[0].s_addr = interfaces->boundto.sin_addr.s_addr;
-
-	memb_ring_id_create_or_load (&my_ring_id);
-	totemsrp_log_printf (totemsrp_log_level_notice, "Created or loaded sequence id \
                %lld.%s for this ring.\n",
-		my_ring_id.seq, inet_ntoa (my_ring_id.rep));
+	netif_down_check();
 
 	memb_state_gather_enter ();
 
@@ -919,6 +918,7 @@
 		"The token was lost in state %d from timer %x\n", memb_state, data);
 	switch (memb_state) {
 		case MEMB_STATE_OPERATIONAL:
+			netif_down_check();	
 			memb_state_gather_enter ();
 			break;
 
@@ -1594,6 +1594,8 @@
 	int i;
 	in_addr_t mask_addr;
 
+	interface_up = 0;
+
 	/*
 	 * Generate list of local interfaces in ifc.ifc_req structure
 	 */
@@ -1601,7 +1603,7 @@
 	ifc.ifc_buf = 0;
 	do {
 		numreqs += 32;
-		ifc.ifc_len = sizeof (struct ifreq) * numreqs;
+		ifc.ifc_len = sizeof (struct ifreq) * numreqs * 10;
 		ifc.ifc_buf = (void *)realloc(ifc.ifc_buf, ifc.ifc_len);
 		res = ioctl (id_fd, SIOCGIFCONF, &ifc);
 		if (res < 0) {
@@ -1624,6 +1626,12 @@
 
 			bound_to->sin_addr.s_addr = sockaddr_in->sin_addr.s_addr;
 			res = i;
+
+			if (ioctl(id_fd, SIOCGIFFLAGS, &ifc.ifc_ifcu.ifcu_req[i]) < 0) {
+				printf ("couldn't do ioctl\n");
+			}
+
+			interface_up = ifc.ifc_ifcu.ifcu_req[i].ifr_ifru.ifru_flags & IFF_UP;
 			break; /* for */
 		}
 	}
@@ -1633,6 +1641,197 @@
 	return (res);
 }
 
+static int loopback_determine (struct sockaddr_in *bound_to)
+{
+
+	bound_to->sin_addr.s_addr = LOCALHOST_IP;
+	if (&bound_to->sin_addr.s_addr == 0) {
+		return -1;
+	}
+	return 1;
+}
+
+
+int firstrun = 0;
+/*
+ * If the interface is up, the sockets for gmi are built.  If the interface is down
+ * this function is requeued in the timer list to retry building the sockets later.
+ */
+static void timer_function_netif_check_timeout ()
+{
+	int res;
+	int interface_no;
+
+	/*
+	* Build sockets for every interface
+	*/
+	for (interface_no = 0; interface_no < totemsrp_interface_count; interface_no++) {
+
+		netif_determine(&totemsrp_interfaces[interface_no].bindnet,
+						&totemsrp_interfaces[interface_no].boundto);
+
+		if ((netif_bind_state & BIND_STATE_LOOPBACK) && (!interface_up)) {
+			break;
+		}
+	
+		if (totemsrp_sockets[interface_no].mcast > 0) {
+			close (totemsrp_sockets[interface_no].mcast);
+		 	poll_dispatch_delete (*totemsrp_poll_handle,
+			totemsrp_sockets[interface_no].mcast);
+		}
+		if (totemsrp_sockets[interface_no].token > 0) {
+			close (totemsrp_sockets[interface_no].token);
+			poll_dispatch_delete (*totemsrp_poll_handle,
+			totemsrp_sockets[interface_no].token);
+		}
+
+		if (!interface_up) {
+			totemsrp_log_printf (totemsrp_log_level_notice,"Interface is down binding to \
LOOPBACK addr.\n"); +			netif_bind_state = BIND_STATE_LOOPBACK;
+			res = totemsrp_build_sockets_loopback(&sockaddr_in_mcast,
+					&totemsrp_interfaces[interface_no].bindnet,
+					&totemsrp_sockets[interface_no],
+					&totemsrp_interfaces[interface_no].boundto);
+
+			poll_dispatch_add (*totemsrp_poll_handle, totemsrp_sockets[interface_no].token,
+					POLLIN, 0, recv_handler, UINT_MAX);
+
+			continue;
+		}
+
+		netif_bind_state = BIND_STATE_REGULAR;
+		memcpy(&sockaddr_in_mcast,&sockaddr_in_mcast_startup, sizeof (struct \
sockaddr_in)); +
+		/*
+		* Create and bind the multicast and unicast sockets
+		*/
+		res = totemsrp_build_sockets (&sockaddr_in_mcast,
+			&totemsrp_interfaces[interface_no].bindnet,
+			&totemsrp_sockets[interface_no],
+			&totemsrp_interfaces[interface_no].boundto);
+
+		poll_dispatch_add (*totemsrp_poll_handle, totemsrp_sockets[interface_no].mcast,
+			POLLIN, 0, recv_handler, UINT_MAX);
+
+		poll_dispatch_add (*totemsrp_poll_handle, totemsrp_sockets[interface_no].token,
+			POLLIN, 0, recv_handler, UINT_MAX);
+	}
+
+	memcpy (&my_id, &totemsrp_interfaces->boundto, sizeof (struct sockaddr_in));	
+	/*
+	* This stuff depends on totemsrp_build_sockets
+	*/
+	if (firstrun == 0) {
+		firstrun += 1;
+		memcpy (&my_memb_list[0], &totemsrp_interfaces->boundto,
+			sizeof (struct sockaddr_in));
+		memb_ring_id_create_or_load (&my_ring_id);
+		totemsrp_log_printf (totemsrp_log_level_notice, "Created or loaded sequence id \
%lld.%s for this ring.\n", +		my_ring_id.seq, inet_ntoa (my_ring_id.rep));
+	}
+
+	if (interface_up) {
+		if (netif_state_report & NETIF_STATE_REPORT_UP) {
+			totemsrp_log_printf (totemsrp_log_level_notice,
+				" The network interface is now up.\n");		
+			netif_state_report = NETIF_STATE_REPORT_DOWN;
+			memb_state_gather_enter ();
+		}
+		/*
+		 * If this is a single processor, detect downs which may not 
+		 * be detected by token loss when the interface is downed
+		 */
+		if (my_memb_entries <= 1) {
+			poll_timer_add (*totemsrp_poll_handle, TIMEOUT_DOWNCHECK, (void *)1,
+				timer_function_netif_check_timeout,
+				&timer_netif_check_timeout);
+		}
+	} else {		
+		if (netif_state_report & NETIF_STATE_REPORT_DOWN) {
+			totemsrp_log_printf (totemsrp_log_level_notice,
+				"The network interface is down.\n");
+			memb_state_gather_enter ();
+		}
+		netif_state_report = NETIF_STATE_REPORT_UP;
+
+		/*
+		* Add a timer to retry building interfaces and request memb_gather_enter
+		*/
+		cancel_token_retransmit_timeout ();
+		cancel_token_timeout ();
+		poll_timer_add (*totemsrp_poll_handle, TIMEOUT_DOWNCHECK, (void *)1,
+			timer_function_netif_check_timeout,
+			&timer_netif_check_timeout);
+	}
+}
+
+
+/*
+ * Check if an interface is down and reconfigure
+ * totemsrp waiting for it to come back up
+ */
+static void netif_down_check (void)
+{
+	timer_function_netif_check_timeout ();
+}
+
+static int totemsrp_build_sockets_loopback (struct sockaddr_in *sockaddr_mcast,
+    struct sockaddr_in *sockaddr_bindnet,
+    struct totemsrp_socket *sockets,
+    struct sockaddr_in *bound_to)
+{
+	struct ip_mreq mreq;
+	struct sockaddr_in sockaddr_in;
+	int res;
+
+	memset (&mreq, 0, sizeof (struct ip_mreq));
+
+	/*
+	 * Determine the ip address bound to and the interface name
+	 */
+	res = loopback_determine (bound_to);
+
+	if (res == -1) {
+		return (-1);
+	}
+
+	/* TODO this should be somewhere else */
+	memb_local_sockaddr_in.sin_addr.s_addr = bound_to->sin_addr.s_addr;
+	memb_local_sockaddr_in.sin_family = AF_INET;
+	memb_local_sockaddr_in.sin_port = sockaddr_mcast->sin_port;
+
+	sockaddr_in.sin_family = AF_INET;
+	sockaddr_in.sin_port = sockaddr_mcast->sin_port;
+
+	 /*
+	 * Setup unicast socket
+	 */
+	sockets->token = socket (AF_INET, SOCK_DGRAM, 0);
+	if (sockets->token == -1) {
+		perror ("socket2");
+		return (-1);
+	}
+
+	/*
+	 * Bind to unicast socket used for token send/receives	
+	 * This has the side effect of binding to the correct interface
+	 */
+	sockaddr_in.sin_addr.s_addr = bound_to->sin_addr.s_addr;
+	res = bind (sockets->token, (struct sockaddr *)&sockaddr_in,
+				sizeof (struct sockaddr_in));
+	if (res == -1) {
+		perror ("bind2 failed");
+		return (-1);
+	}
+
+	//memcpy(sockaddr_mcast, &sockaddr_in, sizeof(struct sockaddr_in));
+	memcpy(&sockaddr_in_mcast, &sockaddr_in, sizeof(struct sockaddr_in));
+	sockets->mcast = sockets->token;
+
+	return (0);
+}
+
+
 static int totemsrp_build_sockets (struct sockaddr_in *sockaddr_mcast,
 	struct sockaddr_in *sockaddr_bindnet,
 	struct totemsrp_socket *sockets,
@@ -1804,6 +2003,7 @@
 	 * Multicast message
 	 */
 	res = sendmsg (totemsrp_sockets[0].mcast, &msg_mcast, MSG_NOSIGNAL | MSG_DONTWAIT);
+	
 	if (res == -1) {
 		return (-1);
 	}
@@ -2123,8 +2323,6 @@
 	msg_orf_token.msg_flags = 0;
 	
 	res = sendmsg (totemsrp_sockets[0].token, &msg_orf_token, MSG_NOSIGNAL);
-	assert (res != -1);
-	assert (res == orf_token_retransmit_size);
 }
 
 /*
@@ -2212,8 +2410,6 @@
 			inet_ntoa (next_memb.sin_addr), 
 			strerror (errno), totemsrp_sockets[0].token);
 	}
-	assert (res != -1);
-	assert (res == iov_encrypted.iov_len);
 	
 	/*
 	 * res not used here errors are handled by algorithm
@@ -2302,7 +2498,6 @@
 	msghdr.msg_flags = 0;
 
 	res = sendmsg (totemsrp_sockets[0].token, &msghdr, MSG_NOSIGNAL | MSG_DONTWAIT);
-	assert (res != -1);
 	return (res);
 }
 
@@ -2395,7 +2590,6 @@
 	msghdr.msg_flags = 0;
 
 	res = sendmsg (totemsrp_sockets[0].mcast, &msghdr, MSG_NOSIGNAL | MSG_DONTWAIT);
-
 	return (res);
 }
 
@@ -2458,7 +2652,6 @@
 	}
 	
 	memb_ring_id->rep.s_addr = my_id.sin_addr.s_addr;
-	assert (memb_ring_id->rep.s_addr);
 	token_ring_id_seq = memb_ring_id->seq;
 }
 



_______________________________________________
Openais mailing list
Openais@lists.osdl.org
http://lists.osdl.org/mailman/listinfo/openais


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic