[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openais
Subject:    [Openais] infinite loop bug in ckpt_recovery_process_members_exit.
From:       "Muni Bajpai" <muniba () nortel ! com>
Date:       2005-11-30 0:39:45
Message-ID: CFCE7C3BDB79204092974B5B50AD719401FC45C2 () zrc2hxm0 ! corp ! nortel ! com
[Download RAW message or body]

Steve,

Dunno how I missed this one ... anyways have attached the fix too

Thanks

Muni
-----Original Message-----
From: Steven Dake [mailto:scd@broked.org] 
Sent: Wednesday, November 23, 2005 4:33 PM
To: Bajpai, Muni [RICH1:B670:EXCH]
Cc: sdake@mvista.com; openais@lists.osdl.org; Smith, Kristen
[RICH1:B670:EXCH]
Subject: RE: [Openais] muni try this 932 take 8

Muni
This patch has been committed at revision 850 and ported to picacho at
revision 851.

On Tue, 2005-11-22 at 09:56, Muni Bajpai wrote:
> Hey Steve,
> 
> So after some intense thinking :) I don't think it is possible that
the
> index can be out of bounds. 
> 
> I think the real issue here is the fact that it is possible to remove
an
> element from the list via checkpoint_release while iterating through
the
> list. I think that might be the issue here.
> 
> So I think we should separate the cleanup from this iteration. It adds
> one more cycle of iteration but is safer.
> 
> Please review the patch
> 
> Thanks
> 
> Muni
> 
> -----Original Message-----
> From: Steven Dake [mailto:sdake@mvista.com] 
> Sent: Monday, November 21, 2005 12:57 PM
> To: Bajpai, Muni [RICH1:B670:EXCH]
> Cc: Smith, Kristen [RICH1:B670:EXCH]
> Subject: RE: [Openais] muni try this 932 take 8
> 
> Muni,
> ckpt_confchg_fn is indeed called with
TOTEM_CONFIGURATION_TRANSITIONAL.
> look at the previous call traces, they all call, and the only way to
get
> into ckpt_recovery_process_members_exit is with a transitional
> configuration.
> 
> Then note that ckpt_recovery_process_members_exit is called.  I
suspect
> that this function is in some way corrupting the stack including
> left_list and configuration_type from the previous stack frame.
> 
> Is it possible the call
>        memset((char*)&checkpoint->ckpt_refcount[index].addr, 0,
> sizeof(struct in_addr));
> 
> is called with an index either negative or greater then the size of
> ckpt_refcount?  This would explain that other refcounting segfault bug
I
> found.  I suggest putting an assert before that memset to see if the
> code is behaving out of your expectations.
> 
> Were you running with RANDOM_DROP set?
> 
> Regards
> -steve
> 
> On Mon, 2005-11-21 at 10:05 -0600, Muni Bajpai wrote:
> > Steve,
> > 
> > This just doesn't make sense. So a segfault happened in the ckpt
> service
> > after about 6 hours of traffic
> > 
> > Things to note in the following trace
> > 1.) ckpt_recovery_process_members_exit is called from
ckpt_confchg_fn
> > ONLY 	if  configuration_type==TOTEM_CONFIGURATION_TRANSITIONAL
which
> > according to the trace is NOT ??????????????? It's a simple if
check.
> > 2.) The left_list pointer has changed from #1 to #0 ?????
> > 
> > See the left_list pointer is clobbered even though there is no
> > manipulation of that pointer in ckpt_confchg_fn. 
> > 
> > The left_list array is initialized in memb_state_operational_enter
and
> > has functional scope hence will not change until the call returns.
> > 
> > I dunno how this is possible  ?
> > 
> > Thanks
> > 
> > Muni
> > 
> > #0  ckpt_recovery_process_members_exit (left_list=0x8394,
> > left_list_entries=3) at ckpt.c:566
> > #1  0x08053795 in ckpt_confchg_fn
> > (configuration_type=TOTEM_CONFIGURATION_REGULAR,
> member_list=0x80bdea0,
> > member_list_entries=1, left_list=0xbfffd360,
> >     left_list_entries=3, joined_list=0x0, joined_list_entries=0,
> > ring_id=0xd9bf682f) at ckpt.c:1127
> > #2  0x0804ab2e in confchg_fn
> > (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL,
> > member_list=0x80bdea0, member_list_entries=1, left_list=0xbfffd360,
> >     left_list_entries=3, joined_list=0x0, joined_list_entries=0,
> > ring_id=0x80bdf78) at main.c:903
> > #3  0x08066397 in totempg_confchg_fn
> > (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL,
> > member_list=0x80bdea0, member_list_entries=1,
> >     left_list=0xbfffd360, left_list_entries=3, joined_list=0x0,
> > joined_list_entries=0, ring_id=0x80bdf78) at totempg.c:239
> > #4  0x0806856d in totemmrp_confchg_fn
> > (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL,
> > member_list=0x80bdea0, member_list_entries=1,
> >     left_list=0xbfffd360, left_list_entries=3, joined_list=0x0,
> > joined_list_entries=0, ring_id=0x80bdf78) at totemmrp.c:94
> > #5  0x08063b40 in memb_state_operational_enter (instance=0x80bdd48)
at
> > totemsrp.c:1392
> > #6  0x0805ef7f in message_handler_orf_token (instance=0x80bdd48,
> > system_from=0xbfffe174, msg=0x80d787c, msg_len=42,
> > endian_conversion_needed=0)
> >     at totemsrp.c:2971
> > #7  0x0806144a in main_deliver_fn (context=0x80bdd48,
> > system_from=0xbfffe174, msg=0x80d787c, msg_len=34024) at
> totemsrp.c:3653
> > #8  0x08067e49 in active_token_recv (instance=0x80bd158,
> interface_no=0,
> > context=0x80bdd48, system_from=0xbfffe174, msg=0x80d787c,
msg_len=42,
> >     token_seqid=0) at totemrrp.c:482
> > #9  0x08067f78 in rrp_deliver_fn (context=0x80bd220,
> > system_from=0xbfffe174, msg=0x80d787c, msg_len=42) at totemrrp.c:542
> > #10 0x08069b04 in net_deliver_fn (handle=0, fd=4, revents=1,
> > data=0x80d7250, prio=0x0) at totemnet.c:688
> > #11 0x0805de25 in poll_run (handle=0) at aispoll.c:433
> > #12 0x0804a243 in main (argc=1, argv=0xbfffe3f4) at main.c:1200
> > 
> > -----Original Message-----
> > From: Steven Dake [mailto:sdake@mvista.com] 
> > Sent: Sunday, November 20, 2005 9:48 PM
> > To: Bajpai, Muni [RICH1:B670:EXCH]
> > Cc: Smith, Kristen [RICH1:B670:EXCH]
> > Subject: RE: [Openais] muni try this 932 take 6
> > 
> > The protocol code in picacho and trunk are the same so testing with
> > either should be fine.  The reason I couldn't reproduce your issues
is
> > that I wasn't trying hard enough :)
> > 
> > Regards
> > -steve
> > On Sun, 2005-11-20 at 01:24 -0600, Muni Bajpai wrote:
> > > Steve Thanks for the effort
> > > 
> > > The reason I bought up the picacho vs trunk issue is so that we
are
> on
> > > the same page as we saw last week that there were some asserts you
> > > couldn't reproduce readily with you being on trunk and me being on
> > > picacho.
> > > 
> > > The problem is that I can start testing with one and switch to
> another
> > > as time unfortunately is getting thinner with our release deadline
> > > approaching.
> > > 
> > > So that's the spill.
> > > 
> > > I just saw that u posted release 8. does this include the patch
for
> > 969
> > > ?
> > > 
> > > P.S will update the man pages further for defect 968.
> > > 
> > > Thanks
> > > 
> > > Muni
> > > -----Original Message-----
> > > From: Steven Dake [mailto:sdake@mvista.com] 
> > > Sent: Saturday, November 19, 2005 1:14 PM
> > > To: Bajpai, Muni [RICH1:B670:EXCH]
> > > Cc: Smith, Kristen [RICH1:B670:EXCH]
> > > Subject: RE: [Openais] muni try this 932 take 6
> > > 
> > > Muni
> > > I should have a new patch coming soon.  The one I sent has a
couple
> > bugs
> > > i found last night.
> > > 
> > > All patches are against trunk, but should apply to picacho without
> > much
> > > trouble.
> > > 
> > > I've been running for about 4 hours now without an assertion with
> > random
> > > drop on..
> > > 
> > > Seems positive atleast :)
> > > 
> > > On Sat, 2005-11-19 at 12:38 -0600, Muni Bajpai wrote:
> > > > Steve,
> > > > 
> > > > So Should I apply/test this patch to trunk or picacho ? (Need to
> > know
> > > > which one to start testing with)
> > > > Also should I also apply 969 as that seems relevant ?
> > > > 
> > > > Thanks
> > > > 
> > > > Muni
> > > > 
> > > > -----Original Message-----
> > > > From: openais-bounces@lists.osdl.org
> > > > [mailto:openais-bounces@lists.osdl.org] On Behalf Of Steven Dake
> > > > Sent: Friday, November 18, 2005 2:48 PM
> > > > To: openais@lists.osdl.org
> > > > Subject: [Openais] muni try this 932 take 6
> > > > 
> > > > 
> > > > Muni
> > > > I've worked on the 932 patch and found another bug.  I seem to
be
> > > > working better now for 3 nodes.
> > > > 
> > > > I have noticed one bug which I'm not sure how I will fix.
> Basically
> > > > during the RECOVERY phase, too m any messages are in the
recovery
> > > queue
> > > > overflowing it.  These messages should be processed at one time.
> > > > 
> > > > Please let me know if you have asserts and the asserts you see.
> > > > 
> > > > I'll run this all weekend and work to get any problems sorted
out
> > over
> > > > the weekend.
> > > > 
> > > > Regards
> > > > -steev
> > > 
> > > 
> > 
> > 
> 
> 
> 
> ______________________________________________________________________
> _______________________________________________
> Openais mailing list
> Openais@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/openais



["ckpt.patch" (application/octet-stream)]

diff -uNr --exclude=svn --exclude=.svn --exclude=SCCS --exclude=BitKeeper \
--exclude=ChangeSet --exclude=init --exclude=LICENSE --exclude=Makefile --exclude=man \
--exclude=README.devmap --exclude=SECURITY --exclude=TODO --exclude=CHANGELOG \
--exclude=conf --exclude=loc --exclude=Makefile.samples --exclude=QUICKSTART \
--exclude=.cdtproject --exclude=.project --exclude=nortel.patch \
                openais/branches/picacho/exec/ckpt.c picacho/exec/ckpt.c
--- openais/branches/picacho/exec/ckpt.c	2005-11-29 17:20:22 -06:00
+++ picacho/exec/ckpt.c	2005-11-29 18:37:20 -06:00
@@ -1022,19 +1022,19 @@
 	 */
 	member = left_list;
 	for (i = 0; i < left_list_entries; i++) {		
-		for (checkpoint_list = checkpoint_list_head.next;
-			checkpoint_list != &checkpoint_list_head;
-			checkpoint_list = checkpoint_list->next) {
+		checkpoint_list = checkpoint_list_head.next;
 
+iterate_while_loop:
+		while (checkpoint_list != &checkpoint_list_head) {
 			checkpoint = list_entry (checkpoint_list,
 				struct saCkptCheckpoint, list);			
 			assert (checkpoint > 0);
-			index = processor_index_find(member, checkpoint->ckpt_refcount);			
+			index = processor_index_find(member, checkpoint->ckpt_refcount);
+			assert (-1 <= index < PROCESSOR_COUNT_MAX);			
 			if (index < 0) {
 				checkpoint_list = checkpoint_list->next;
-				continue;
+				goto iterate_while_loop;
 			}		
-			assert ( index < PROCESSOR_COUNT_MAX);
 			/*
 			 * Decrement
 			 * 
@@ -1051,6 +1051,7 @@
 			}		
 			checkpoint->ckpt_refcount[index].count = 0;
 			memset((char*)&checkpoint->ckpt_refcount[index].addr, 0, sizeof(struct \
in_addr));			 +			checkpoint_list = checkpoint_list->next;
 		}
 		member++;
 	}



_______________________________________________
Openais mailing list
Openais@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/openais


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic