'Re: [Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: lrm by zhenh from'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha-dev
Subject:    Re: [Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: lrm by zhenh from
From:       Huang Zhen <zhenhltc () cn ! ibm ! com>
Date:       2005-04-29 6:15:07
Message-ID: 4271D0EB.7080705 () cn ! ibm ! com
[Download RAW message or body]

Alan Robertson wrote:
> linux-ha-cvs@lists.linux-ha.org wrote:
> 
>> linux-ha CVS committal
>>
>> Author  : zhenh
>> Host    : Project : linux-ha
>> Module  : lrm
>>
>> Dir     : linux-ha/lrm/lrmd
>>
>>
>> Modified Files:
>>     lrmd.c
>>
>> Log Message:
>> when the resource was deleted during the operation performing, this 
>> would cause core dump.
>> ===================================================================
>> RCS file: /home/cvs/linux-ha/linux-ha/lrm/lrmd/lrmd.c,v
>> retrieving revision 1.102
>> retrieving revision 1.103
>> diff -u -3 -r1.102 -r1.103
>> --- lrmd.c    27 Apr 2005 21:59:37 -0000    1.102
>> +++ lrmd.c    28 Apr 2005 02:57:56 -0000    1.103
>> @@ -1,4 +1,4 @@
>> -/* $Id: lrmd.c,v 1.102 2005/04/27 21:59:37 alan Exp $ */
>> +/* $Id: lrmd.c,v 1.103 2005/04/28 02:57:56 zhenh Exp $ */
>>  /*
>>   * Local Resource Manager Daemon
>>   *
>> @@ -2504,8 +2504,14 @@
>>      }
>>  
>>      op_type = ha_msg_value(op->msg, F_LRM_OP);
>> -
>> -    snprintf(proc_name, MAX_PROC_NAME, "%s:%s", op->rsc->id, op_type);
>> +    if (NULL == g_list_find(rsc_list, op->rsc)) {
>> +        snprintf(proc_name
>> +        , MAX_PROC_NAME
>> +        , "unknown rsc(may deleted):%s"
>> +        , op_type);
>> +    }else {   
>> +        snprintf(proc_name, MAX_PROC_NAME, "%s:%s", op->rsc->id, 
>> op_type);
>> +    }
> 
> 
> 
> This is not a bad idea by itself, but it will not fix your problem, and 
> it comes closer to hiding it than anything.
> 
> Because whenever you delete a resource, you will also delete all the 
> operations on it as well.  So, in addition to the resource being 
> deleted, the operation itself has also been deleted, so you can do 
> NOTHING legal with any remaining operation pointers which might happen 
> to be laying around.
> This is a symptom of not thinking through the lifetimes of your 
> operations and resources well.
> 

How can we assign the op pointer in
NewTrackedProc(pid, 1, PT_LOGNONE, op, &ManagedChildTrackOps);
before the process exiting?

My solution for this case is kill the process and call free_op() again in
the handler of the process die. So in the second call of free_op(),
the memory of op will be freed.
At least this situation had been considered carefully before.
Even using ref count we still need to deal with this condition.

And because the free_op() may be called when deleting resource, any proc track handler
should check the rsc pointer by g_list_lookup(rsc_list).
The core dumps are caused I add more logs to the handlers recently and using the rsc pointer
without check.

> 
> I really believe that a method where things were based on reference 
> counts would be much better - and *much* less likely to cause core dumps 
> - either that or some kind of a well-thought out scheme based on a deep 
> understanding of what you're doing, when, and why.
> 
> Right now, you don't have either.  This explains why you're having 
> trouble, and why you're putting in somewhat kludgy pieces of code like 
> this in, and then closing the ticket that talks about the trouble.
> 
> This problem is NOT gone in your code.  I can still see it there.
> 
> The way reference counts work is that every time you store away a new 
> pointer to a data structure you increment the reference count on the 
> structure, and every time you zap such a pointer (set it to NULL, or 
> free the structure containing it, or whatever), you decrement the 
> reference count.  When the reference counts go to zero, you "really" 
> free the structure.
> 
> The place where this becomes a problem is if you have circular links.
> 
> For example structure "A" points to "B" which in turn points back to 
> "A". If you don't take care, then neither "A" nor "B" will ever be freed.
> 
> This can probably be managed for your case if you're thoughtful.
> 
> If you use reference counts (which I recommend), I would STRONGLY 
> suggest adding this logic to your code:
> 
>        Every time you allocate a new object of type X, you increment
>     some global counter specific for type X.
>        Every time the reference count on an object and it's counter
>         goes to zero, and you free that object, you decrement that
>         counter.
> 
>        Any time your counters differ from the number of accessible "live"
>          objects by more than "n" you print an ERROR: message.  Because
>          that means you aren't handling your reference counts correctly
>          and you have a memory leak - or maybe pointers pointing
>          at dead objects (if the difference is negative).
> 
> BUT, as of now this code STILL exists in your code.  It is not fixed, 
> and a kludge that hides it is NOT reason to close the bugzilla ticket.
> 
Ok, I will add the reference count to control the pointers.
> 

-- 
Best Regards,
Huang Zhen
LTC and Power Linux Testing
IBM China Development Lab, Beijing
Telno: (8610)82782244-2845
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
[prev in list] [next in list] [prev in thread] [next in thread]