'RE: [Openais] 932 - take 10'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openais
Subject:    RE: [Openais] 932 - take 10
From:       "Muni Bajpai" <muniba () nortel ! com>
Date:       2005-11-23 21:46:01
Message-ID: CFCE7C3BDB79204092974B5B50AD719401FC45B5 () zrc2hxm0 ! corp ! nortel ! com
[Download RAW message or body]

Running traffic now for 48 hours. Has been running for an hour so will
let you know

The patch looks good to me and again thanks for doing this on ur
vacation

Muni

-----Original Message-----
From: openais-bounces@lists.osdl.org
[mailto:openais-bounces@lists.osdl.org] On Behalf Of Steven Dake
Sent: Wednesday, November 23, 2005 12:19 PM
To: Bajpai, Muni [RICH1:B670:EXCH]
Cc: openais@lists.osdl.org; Smith, Kristen [RICH1:B670:EXCH]
Subject: [Openais] 932 - take 10

Muni
I noticed all comparisons are either less then or less-then-equal so I
wrote two functions to handle the comparison operations.  This radically
simplifies the patch and makes it look "alot" cleaner.  It may also fix
the 4 processor bug you notice below.  I was never happy with the
previous comparison junk that was all over the patch.

I fixed the messages_free assert.

I also did a final audit of the comparison operations and ensure all
comparison operations use the lt/lte operations when comparing any kind
of sequence number.  This added a few comparisons in the recovery phase.
I also double and triple checked that I didn't mess up the operations
during the addition of the function call.  A quadruple check from you
would be much appreciated :)

Please report if you still have this 4 processor bug.  Also, please
report if one of the processors that "locks up" prints out "zero" very
rapidly.  This can be tested by pressing a key on the console of the
screen and if it scrolls too fast (on the order of zero printed several
hundred times a second or faster), then zero is printed very fast :)

I don't have 4 machines to test with as I am on vacation and at home.

I have run this patch all last night and this morning without any
problems.

When testing for the 4 processor operation, make sure the random drop
operation is the same as your version that is bugging and that the start
points are the same.

Also, what is your ring identifier.  If it is very large there may be
rollover bugs with the ring identifier...  But this value is 64 bits so
a rollover is unlikely.

Regards
-steve

On Wed, 2005-11-23 at 09:52, Muni Bajpai wrote:
> Steve,
> 
> This patch doesn't seem right when applied to picacho.
> 2 Issues
> 
> 1.) Bring up 4 nodes with RANDOM_DROP in Makefile
> 2.) Start ckptbench on node 1. After some time nodes 3 and 4 partition

> and become 1 node clusters. Even if you ctrl-c and restart aisexec 
> they still remain 1 node clusters. Only way to reform is to kill all 4

> and restart
> 
> The 2nd issue is a segfault that I cant seem to reproduce on demand
> 
> aisexec: totemsrp.c:1921: messages_free: Assertion `range < 1024' 
> failed. #0  0xb74c3cdf in raise () from /lib/tls/libc.so.6
> (gdb) bt
> #0  0xb74c3cdf in raise () from /lib/tls/libc.so.6
> #1  0xb74c54e5 in abort () from /lib/tls/libc.so.6
> #2  0xb74bd609 in __assert_fail () from /lib/tls/libc.so.6
> #3  0x080604c1 in messages_free (instance=0x80bbd48, token_aru=6) at
> sq.h:235
> #4  0x0805ca79 in message_handler_orf_token (instance=0x80bbd48,
>     system_from=0xbfffe0f4, msg=0x80d587c, msg_len=42,
>     endian_conversion_needed=0) at totemsrp.c:2812
> #5  0x0805f802 in main_deliver_fn (context=0x80bbd48,
> system_from=0xbfffe0f4,
>     msg=0x80d587c, msg_len=5728) at totemsrp.c:3662
> #6  0x08066215 in active_token_recv (instance=0x80bb158,
interface_no=0,
>     context=0x80bbd48, system_from=0xbfffe0f4, msg=0x80d587c,
> msg_len=42,
>     token_seqid=0) at totemrrp.c:482
> #7  0x08066344 in rrp_deliver_fn (context=0x80bb220,
> system_from=0xbfffe0f4,
>     msg=0x80d587c, msg_len=42) at totemrrp.c:542
> #8  0x08067ed0 in net_deliver_fn (handle=0, fd=4, revents=1,
> data=0x80d5250,
>     prio=0x0) at totemnet.c:688
> #9  0x0805be85 in poll_run (handle=0) at aispoll.c:433
> #10 0x0804a240 in main (argc=1, argv=0xbfffe374) at main.c:1198
> 
> Thanks
> 
> Muni
> 
> -----Original Message-----
> From: Steven Dake [mailto:scd@broked.org]
> Sent: Tuesday, November 22, 2005 12:44 PM
> To: Bajpai, Muni [RICH1:B670:EXCH]
> Cc: sdake@mvista.com; openais@lists.osdl.org; Smith, Kristen
> [RICH1:B670:EXCH]
> Subject: RE: [Openais] muni try this 932 take 8
> 
> Muni,
> 
> Indeed I think your right it must be release_checkpoint which is 
> corrupting the list.  The iteration used is not safe from deletion of 
> entries.
> 
> Another option to keep the code as is, is to use a safe iteration 
> during deletion.
> 
> I think we can go ahead with this patch though as is.
> Thanks
> -steve
>    
> On Tue, 2005-11-22 at 09:56, Muni Bajpai wrote:
> > Hey Steve,
> > 
> > So after some intense thinking :) I don't think it is possible that
> the
> > index can be out of bounds.
> > 
> > I think the real issue here is the fact that it is possible to 
> > remove
> an
> > element from the list via checkpoint_release while iterating through
> the
> > list. I think that might be the issue here.
> > 
> > So I think we should separate the cleanup from this iteration. It 
> > adds one more cycle of iteration but is safer.
> > 
> > Please review the patch
> > 
> > Thanks
> > 
> > Muni
> > 
> > -----Original Message-----
> > From: Steven Dake [mailto:sdake@mvista.com]
> > Sent: Monday, November 21, 2005 12:57 PM
> > To: Bajpai, Muni [RICH1:B670:EXCH]
> > Cc: Smith, Kristen [RICH1:B670:EXCH]
> > Subject: RE: [Openais] muni try this 932 take 8
> > 
> > Muni,
> > ckpt_confchg_fn is indeed called with
> TOTEM_CONFIGURATION_TRANSITIONAL.
> > look at the previous call traces, they all call, and the only way to
> get
> > into ckpt_recovery_process_members_exit is with a transitional 
> > configuration.
> > 
> > Then note that ckpt_recovery_process_members_exit is called.  I
> suspect
> > that this function is in some way corrupting the stack including 
> > left_list and configuration_type from the previous stack frame.
> > 
> > Is it possible the call
> >        memset((char*)&checkpoint->ckpt_refcount[index].addr, 0, 
> > sizeof(struct in_addr));
> > 
> > is called with an index either negative or greater then the size of 
> > ckpt_refcount?  This would explain that other refcounting segfault 
> > bug
> I
> > found.  I suggest putting an assert before that memset to see if the

> > code is behaving out of your expectations.
> > 
> > Were you running with RANDOM_DROP set?
> > 
> > Regards
> > -steve
> > 
> > On Mon, 2005-11-21 at 10:05 -0600, Muni Bajpai wrote:
> > > Steve,
> > > 
> > > This just doesn't make sense. So a segfault happened in the ckpt
> > service
> > > after about 6 hours of traffic
> > > 
> > > Things to note in the following trace
> > > 1.) ckpt_recovery_process_members_exit is called from
> ckpt_confchg_fn
> > > ONLY 	if  configuration_type==TOTEM_CONFIGURATION_TRANSITIONAL
> which
> > > according to the trace is NOT ??????????????? It's a simple if
> check.
> > > 2.) The left_list pointer has changed from #1 to #0 ?????
> > > 
> > > See the left_list pointer is clobbered even though there is no 
> > > manipulation of that pointer in ckpt_confchg_fn.
> > > 
> > > The left_list array is initialized in memb_state_operational_enter
> and
> > > has functional scope hence will not change until the call returns.
> > > 
> > > I dunno how this is possible  ?
> > > 
> > > Thanks
> > > 
> > > Muni
> > > 
> > > #0  ckpt_recovery_process_members_exit (left_list=0x8394,
> > > left_list_entries=3) at ckpt.c:566
> > > #1  0x08053795 in ckpt_confchg_fn 
> > > (configuration_type=TOTEM_CONFIGURATION_REGULAR,
> > member_list=0x80bdea0,
> > > member_list_entries=1, left_list=0xbfffd360,
> > >     left_list_entries=3, joined_list=0x0, joined_list_entries=0,
> > > ring_id=0xd9bf682f) at ckpt.c:1127
> > > #2  0x0804ab2e in confchg_fn 
> > > (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL,
> > > member_list=0x80bdea0, member_list_entries=1,
left_list=0xbfffd360,
> > >     left_list_entries=3, joined_list=0x0, joined_list_entries=0,
> > > ring_id=0x80bdf78) at main.c:903
> > > #3  0x08066397 in totempg_confchg_fn 
> > > (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL,
> > > member_list=0x80bdea0, member_list_entries=1,
> > >     left_list=0xbfffd360, left_list_entries=3, joined_list=0x0, 
> > > joined_list_entries=0, ring_id=0x80bdf78) at totempg.c:239 #4  
> > > 0x0806856d in totemmrp_confchg_fn 
> > > (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL,
> > > member_list=0x80bdea0, member_list_entries=1,
> > >     left_list=0xbfffd360, left_list_entries=3, joined_list=0x0, 
> > > joined_list_entries=0, ring_id=0x80bdf78) at totemmrp.c:94 #5  
> > > 0x08063b40 in memb_state_operational_enter (instance=0x80bdd48)
> at
> > > totemsrp.c:1392
> > > #6  0x0805ef7f in message_handler_orf_token (instance=0x80bdd48, 
> > > system_from=0xbfffe174, msg=0x80d787c, msg_len=42,
> > > endian_conversion_needed=0)
> > >     at totemsrp.c:2971
> > > #7  0x0806144a in main_deliver_fn (context=0x80bdd48, 
> > > system_from=0xbfffe174, msg=0x80d787c, msg_len=34024) at
> > totemsrp.c:3653
> > > #8  0x08067e49 in active_token_recv (instance=0x80bd158,
> > interface_no=0,
> > > context=0x80bdd48, system_from=0xbfffe174, msg=0x80d787c,
> msg_len=42,
> > >     token_seqid=0) at totemrrp.c:482
> > > #9  0x08067f78 in rrp_deliver_fn (context=0x80bd220, 
> > > system_from=0xbfffe174, msg=0x80d787c, msg_len=42) at 
> > > totemrrp.c:542 #10 0x08069b04 in net_deliver_fn (handle=0, fd=4, 
> > > revents=1, data=0x80d7250, prio=0x0) at totemnet.c:688 #11 
> > > 0x0805de25 in poll_run (handle=0) at aispoll.c:433 #12 0x0804a243 
> > > in main (argc=1, argv=0xbfffe3f4) at main.c:1200
> > > 
> > > -----Original Message-----
> > > From: Steven Dake [mailto:sdake@mvista.com]
> > > Sent: Sunday, November 20, 2005 9:48 PM
> > > To: Bajpai, Muni [RICH1:B670:EXCH]
> > > Cc: Smith, Kristen [RICH1:B670:EXCH]
> > > Subject: RE: [Openais] muni try this 932 take 6
> > > 
> > > The protocol code in picacho and trunk are the same so testing 
> > > with either should be fine.  The reason I couldn't reproduce your 
> > > issues
> is
> > > that I wasn't trying hard enough :)
> > > 
> > > Regards
> > > -steve
> > > On Sun, 2005-11-20 at 01:24 -0600, Muni Bajpai wrote:
> > > > Steve Thanks for the effort
> > > > 
> > > > The reason I bought up the picacho vs trunk issue is so that we
> are
> > on
> > > > the same page as we saw last week that there were some asserts 
> > > > you couldn't reproduce readily with you being on trunk and me 
> > > > being on picacho.
> > > > 
> > > > The problem is that I can start testing with one and switch to
> > another
> > > > as time unfortunately is getting thinner with our release 
> > > > deadline approaching.
> > > > 
> > > > So that's the spill.
> > > > 
> > > > I just saw that u posted release 8. does this include the patch
> for
> > > 969
> > > > ?
> > > > 
> > > > P.S will update the man pages further for defect 968.
> > > > 
> > > > Thanks
> > > > 
> > > > Muni
> > > > -----Original Message-----
> > > > From: Steven Dake [mailto:sdake@mvista.com]
> > > > Sent: Saturday, November 19, 2005 1:14 PM
> > > > To: Bajpai, Muni [RICH1:B670:EXCH]
> > > > Cc: Smith, Kristen [RICH1:B670:EXCH]
> > > > Subject: RE: [Openais] muni try this 932 take 6
> > > > 
> > > > Muni
> > > > I should have a new patch coming soon.  The one I sent has a
> couple
> > > bugs
> > > > i found last night.
> > > > 
> > > > All patches are against trunk, but should apply to picacho 
> > > > without
> > > much
> > > > trouble.
> > > > 
> > > > I've been running for about 4 hours now without an assertion 
> > > > with
> > > random
> > > > drop on..
> > > > 
> > > > Seems positive atleast :)
> > > > 
> > > > On Sat, 2005-11-19 at 12:38 -0600, Muni Bajpai wrote:
> > > > > Steve,
> > > > > 
> > > > > So Should I apply/test this patch to trunk or picacho ? (Need 
> > > > > to
> > > know
> > > > > which one to start testing with)
> > > > > Also should I also apply 969 as that seems relevant ?
> > > > > 
> > > > > Thanks
> > > > > 
> > > > > Muni
> > > > > 
> > > > > -----Original Message-----
> > > > > From: openais-bounces@lists.osdl.org 
> > > > > [mailto:openais-bounces@lists.osdl.org] On Behalf Of Steven 
> > > > > Dake
> > > > > Sent: Friday, November 18, 2005 2:48 PM
> > > > > To: openais@lists.osdl.org
> > > > > Subject: [Openais] muni try this 932 take 6
> > > > > 
> > > > > 
> > > > > Muni
> > > > > I've worked on the 932 patch and found another bug.  I seem to
> be
> > > > > working better now for 3 nodes.
> > > > > 
> > > > > I have noticed one bug which I'm not sure how I will fix.
> > Basically
> > > > > during the RECOVERY phase, too m any messages are in the
> recovery
> > > > queue
> > > > > overflowing it.  These messages should be processed at one 
> > > > > time.
> > > > > 
> > > > > Please let me know if you have asserts and the asserts you 
> > > > > see.
> > > > > 
> > > > > I'll run this all weekend and work to get any problems sorted
> out
> > > over
> > > > > the weekend.
> > > > > 
> > > > > Regards
> > > > > -steev
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> > 
> > ____________________________________________________________________
> > __
> > _______________________________________________
> > Openais mailing list
> > Openais@lists.osdl.org
> > https://lists.osdl.org/mailman/listinfo/openais
> 
> 
> 
> 
> ______________________________________________________________________
> _______________________________________________
> Openais mailing list
> Openais@lists.osdl.org https://lists.osdl.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
Openais@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/openais

[prev in list] [next in list] [prev in thread] [next in thread]