[prev in list] [next in list] [prev in thread] [next in thread]
List: openais
Subject: RE: [Openais] muni try this 932 take 8
From: Steven Dake <scd () broked ! org>
Date: 2005-11-23 22:32:33
Message-ID: 1132785153.13091.40.camel () slickdeal ! broked ! org
[Download RAW message or body]
Muni
This patch has been committed at revision 850 and ported to picacho at
revision 851.
On Tue, 2005-11-22 at 09:56, Muni Bajpai wrote:
> Hey Steve,
>
> So after some intense thinking :) I don't think it is possible that the
> index can be out of bounds.
>
> I think the real issue here is the fact that it is possible to remove an
> element from the list via checkpoint_release while iterating through the
> list. I think that might be the issue here.
>
> So I think we should separate the cleanup from this iteration. It adds
> one more cycle of iteration but is safer.
>
> Please review the patch
>
> Thanks
>
> Muni
>
> -----Original Message-----
> From: Steven Dake [mailto:sdake@mvista.com]
> Sent: Monday, November 21, 2005 12:57 PM
> To: Bajpai, Muni [RICH1:B670:EXCH]
> Cc: Smith, Kristen [RICH1:B670:EXCH]
> Subject: RE: [Openais] muni try this 932 take 8
>
> Muni,
> ckpt_confchg_fn is indeed called with TOTEM_CONFIGURATION_TRANSITIONAL.
> look at the previous call traces, they all call, and the only way to get
> into ckpt_recovery_process_members_exit is with a transitional
> configuration.
>
> Then note that ckpt_recovery_process_members_exit is called. I suspect
> that this function is in some way corrupting the stack including
> left_list and configuration_type from the previous stack frame.
>
> Is it possible the call
> memset((char*)&checkpoint->ckpt_refcount[index].addr, 0,
> sizeof(struct in_addr));
>
> is called with an index either negative or greater then the size of
> ckpt_refcount? This would explain that other refcounting segfault bug I
> found. I suggest putting an assert before that memset to see if the
> code is behaving out of your expectations.
>
> Were you running with RANDOM_DROP set?
>
> Regards
> -steve
>
> On Mon, 2005-11-21 at 10:05 -0600, Muni Bajpai wrote:
> > Steve,
> >
> > This just doesn't make sense. So a segfault happened in the ckpt
> service
> > after about 6 hours of traffic
> >
> > Things to note in the following trace
> > 1.) ckpt_recovery_process_members_exit is called from ckpt_confchg_fn
> > ONLY if configuration_type==TOTEM_CONFIGURATION_TRANSITIONAL which
> > according to the trace is NOT ??????????????? It's a simple if check.
> > 2.) The left_list pointer has changed from #1 to #0 ?????
> >
> > See the left_list pointer is clobbered even though there is no
> > manipulation of that pointer in ckpt_confchg_fn.
> >
> > The left_list array is initialized in memb_state_operational_enter and
> > has functional scope hence will not change until the call returns.
> >
> > I dunno how this is possible ?
> >
> > Thanks
> >
> > Muni
> >
> > #0 ckpt_recovery_process_members_exit (left_list=0x8394,
> > left_list_entries=3) at ckpt.c:566
> > #1 0x08053795 in ckpt_confchg_fn
> > (configuration_type=TOTEM_CONFIGURATION_REGULAR,
> member_list=0x80bdea0,
> > member_list_entries=1, left_list=0xbfffd360,
> > left_list_entries=3, joined_list=0x0, joined_list_entries=0,
> > ring_id=0xd9bf682f) at ckpt.c:1127
> > #2 0x0804ab2e in confchg_fn
> > (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL,
> > member_list=0x80bdea0, member_list_entries=1, left_list=0xbfffd360,
> > left_list_entries=3, joined_list=0x0, joined_list_entries=0,
> > ring_id=0x80bdf78) at main.c:903
> > #3 0x08066397 in totempg_confchg_fn
> > (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL,
> > member_list=0x80bdea0, member_list_entries=1,
> > left_list=0xbfffd360, left_list_entries=3, joined_list=0x0,
> > joined_list_entries=0, ring_id=0x80bdf78) at totempg.c:239
> > #4 0x0806856d in totemmrp_confchg_fn
> > (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL,
> > member_list=0x80bdea0, member_list_entries=1,
> > left_list=0xbfffd360, left_list_entries=3, joined_list=0x0,
> > joined_list_entries=0, ring_id=0x80bdf78) at totemmrp.c:94
> > #5 0x08063b40 in memb_state_operational_enter (instance=0x80bdd48) at
> > totemsrp.c:1392
> > #6 0x0805ef7f in message_handler_orf_token (instance=0x80bdd48,
> > system_from=0xbfffe174, msg=0x80d787c, msg_len=42,
> > endian_conversion_needed=0)
> > at totemsrp.c:2971
> > #7 0x0806144a in main_deliver_fn (context=0x80bdd48,
> > system_from=0xbfffe174, msg=0x80d787c, msg_len=34024) at
> totemsrp.c:3653
> > #8 0x08067e49 in active_token_recv (instance=0x80bd158,
> interface_no=0,
> > context=0x80bdd48, system_from=0xbfffe174, msg=0x80d787c, msg_len=42,
> > token_seqid=0) at totemrrp.c:482
> > #9 0x08067f78 in rrp_deliver_fn (context=0x80bd220,
> > system_from=0xbfffe174, msg=0x80d787c, msg_len=42) at totemrrp.c:542
> > #10 0x08069b04 in net_deliver_fn (handle=0, fd=4, revents=1,
> > data=0x80d7250, prio=0x0) at totemnet.c:688
> > #11 0x0805de25 in poll_run (handle=0) at aispoll.c:433
> > #12 0x0804a243 in main (argc=1, argv=0xbfffe3f4) at main.c:1200
> >
> > -----Original Message-----
> > From: Steven Dake [mailto:sdake@mvista.com]
> > Sent: Sunday, November 20, 2005 9:48 PM
> > To: Bajpai, Muni [RICH1:B670:EXCH]
> > Cc: Smith, Kristen [RICH1:B670:EXCH]
> > Subject: RE: [Openais] muni try this 932 take 6
> >
> > The protocol code in picacho and trunk are the same so testing with
> > either should be fine. The reason I couldn't reproduce your issues is
> > that I wasn't trying hard enough :)
> >
> > Regards
> > -steve
> > On Sun, 2005-11-20 at 01:24 -0600, Muni Bajpai wrote:
> > > Steve Thanks for the effort
> > >
> > > The reason I bought up the picacho vs trunk issue is so that we are
> on
> > > the same page as we saw last week that there were some asserts you
> > > couldn't reproduce readily with you being on trunk and me being on
> > > picacho.
> > >
> > > The problem is that I can start testing with one and switch to
> another
> > > as time unfortunately is getting thinner with our release deadline
> > > approaching.
> > >
> > > So that's the spill.
> > >
> > > I just saw that u posted release 8. does this include the patch for
> > 969
> > > ?
> > >
> > > P.S will update the man pages further for defect 968.
> > >
> > > Thanks
> > >
> > > Muni
> > > -----Original Message-----
> > > From: Steven Dake [mailto:sdake@mvista.com]
> > > Sent: Saturday, November 19, 2005 1:14 PM
> > > To: Bajpai, Muni [RICH1:B670:EXCH]
> > > Cc: Smith, Kristen [RICH1:B670:EXCH]
> > > Subject: RE: [Openais] muni try this 932 take 6
> > >
> > > Muni
> > > I should have a new patch coming soon. The one I sent has a couple
> > bugs
> > > i found last night.
> > >
> > > All patches are against trunk, but should apply to picacho without
> > much
> > > trouble.
> > >
> > > I've been running for about 4 hours now without an assertion with
> > random
> > > drop on..
> > >
> > > Seems positive atleast :)
> > >
> > > On Sat, 2005-11-19 at 12:38 -0600, Muni Bajpai wrote:
> > > > Steve,
> > > >
> > > > So Should I apply/test this patch to trunk or picacho ? (Need to
> > know
> > > > which one to start testing with)
> > > > Also should I also apply 969 as that seems relevant ?
> > > >
> > > > Thanks
> > > >
> > > > Muni
> > > >
> > > > -----Original Message-----
> > > > From: openais-bounces@lists.osdl.org
> > > > [mailto:openais-bounces@lists.osdl.org] On Behalf Of Steven Dake
> > > > Sent: Friday, November 18, 2005 2:48 PM
> > > > To: openais@lists.osdl.org
> > > > Subject: [Openais] muni try this 932 take 6
> > > >
> > > >
> > > > Muni
> > > > I've worked on the 932 patch and found another bug. I seem to be
> > > > working better now for 3 nodes.
> > > >
> > > > I have noticed one bug which I'm not sure how I will fix.
> Basically
> > > > during the RECOVERY phase, too m any messages are in the recovery
> > > queue
> > > > overflowing it. These messages should be processed at one time.
> > > >
> > > > Please let me know if you have asserts and the asserts you see.
> > > >
> > > > I'll run this all weekend and work to get any problems sorted out
> > over
> > > > the weekend.
> > > >
> > > > Regards
> > > > -steev
> > >
> > >
> >
> >
>
>
>
> ______________________________________________________________________
> _______________________________________________
> Openais mailing list
> Openais@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/openais
_______________________________________________
Openais mailing list
Openais@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/openais
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic