'Re: [Linux-ha-dev] HA process runs twice since 1.0.3...'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha-dev
Subject:    Re: [Linux-ha-dev] HA process runs twice since 1.0.3...
From:       "Andrew Beekhof" <Andrew.Beekhof () novell ! com>
Date:       2003-07-29 6:57:38
[Download RAW message or body]

Hi All,

I'm running  1.0.3 on RH9 and seem to be seeing similar behaviour.  my
problem though, is that my service hasnt had a chance to finish starting
(as you can see below) so technically i shouldnt be returning "OK" or
"running" yet.

So my qustion is: Is the ps trace below normal?  at one point i had 3
entries for "heartbeat: heartbeat: req_our_resources()".  if it is
normal... is there perhaps a parameter i can change so that heartbeat
doesnt try to (re)start the service so often?

I also noticed the same behaviour for a "default" install of dhcpd
which means it is probably not something in my service that is causing
the behaviour.

As an aside, is it preferred that we use /etc/init.d services directly
or that we create service wrappers in /etc/ha.d/resource.d ?

Any thoughts appreciated,
Andrew

 1859 ?        SL     0:00 heartbeat: heartbeat: control process
 1883 ?        SL     0:00  \_ [heartbeat]
 1884 ?        SL     0:00  \_ [heartbeat]
 1885 ?        SL     0:00  \_ [heartbeat]
 1886 ?        SL     0:00  \_ [heartbeat]
 1887 ?        SL     0:00  \_ heartbeat: heartbeat: master status
process
 2166 ?        S      0:00      \_ heartbeat: heartbeat:
req_our_resources()
 2167 ?        Z      0:00      |   \_ [ResourceManager <defunct>]
 2856 ?        S      0:00      |   \_ /bin/sh
/usr/lib/heartbeat/req_resource 1
* 2902 ?        S      0:00      |       \_ /bin/sh
/usr/lib/heartbeat/ResourceMa
* 3760 ?        S      0:00      |           \_ /bin/sh
/etc/init.d/nims start
 3776 ?        S      0:00      |               \_ /bin/sh
/etc/init.d/psql star
 3787 ?        S      0:00      |                   \_ initlog -q -c su
-s /bin/
 3796 ?        S      0:00      |                       \_ [su]
 3842 ?        R      0:00      |                           \_
/usr/local/psql/b
 2252 ?        S      0:00      \_ heartbeat: heartbeat:
req_our_resources()
 2253 ?        Z      0:00          \_ [ResourceManager <defunct>]
 2935 ?        S      0:00          \_ /bin/sh
/usr/lib/heartbeat/req_resource 1
* 2996 ?        S      0:00              \_ /bin/sh
/usr/lib/heartbeat/ResourceMa
* 3915 ?        S      0:00                  \_ /bin/sh
/etc/init.d/nims start
 3926 ?        R      0:00                      \_ ps axf

haresources:
############
node1 10.0.7.1 dhcpd
node1 10.0.7.2 mnt_services ifolder nims
node1 10.0.7.4 mnt_home httpd

ha.cf:
############
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility     local0
keepalive 1
deadtime 30
warntime 10
initdead 120
nice_failback off
udpport 694
mcast eth0 225.0.0.1 694 1 0
ping 10.0.0.138
node    node1
node    node2

>>> alanr@unix.sh 20/07/2003 10:52:46 pm >>>
Hiromitsu Sakai wrote:
> 
> 
> 
> 
> Hello, Alan,
> 
> Thank you for your response.
> 
> 
>>I can see one problem with this... You have to support the status
> 
> operation.
> 
>>  You have to support it correctly.  As far as heartbeat is
concerned,
>>status is important, and you have to print "OK" or "running" in the
> 
> output
> 
>>when it's running correctly, and not print that when it's not
running.
> 
> 
> I see...
> To which files should I add for using "status"? ha.cf?

No, your resource script.  The one you named
/etc/ha.d/resource.d/service.

> I have tried editting "/usr/lib/heartbeat/ResourceManager" and
> starting HA.
> I found that this script was worked TWICE in "/tmp/testout"..

Actually, it seems to be called 4 times:
	/usr/lib/heartbeat/ResourceManager: verifyallidle
	/usr/lib/heartbeat/ResourceManager: listkeys kiki
	/usr/lib/heartbeat/ResourceManager: status 192.168.1.3/24
	/usr/lib/heartbeat/ResourceManager: takegroup 192.168.1.3/24

Is that what you see?

The first time it gets called is by the init script to make sure that
all 
resources are idle when we start up ("verifyallidle").

The next time is to list the set of resource groups associated with the

server kiki ("listkeys")

The next time is to check on the status of the IP address ("status")

The last time is to take over the resources in resource group 
192.168.1.3/24.  This time it gets called by the ip-request-resp
script. 
When nice_failback is off, we normally send out an ip-request packet,
which 
causes the other side to issue an ip-request-resp packet.  This packet
in 
turn causes the local side to run the ip-request-resp script.  This
script 
then runs the ResourceManager takegroup operation.

This is the only one of those operations which actually try and acquire

resources.

The only change I've made to the stable branch of ResourceManager was
over a 
year ago, and it was made before we made the 1.0.1 version.

Good debug output.  Unfortunately, it doesn't point to the source of
the 
problem :-(.

I take it that in the logs this time, you saw the processes being taken
over 
twice?

-- 
     Alan Robertson <alanr@unix.sh>

"Openness is the foundation and preservative of friendship...  Let me
claim 
from you at all times your undisguised opinions." - William
Wilberforce

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.community.tummy.com 
http://lists.community.tummy.com/mailman/listinfo/linux-ha-dev 
Home Page: http://linux-ha.org/

Regards,
Andrew

-----------------------------------------------------------------
Andrew Beekhof, 
Technical Specialist - Australia 
Phone: 61-3-9520-3545 Fax: 61-3-9520-3555 Mobile: 61-419-691-886
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.community.tummy.com
http://lists.community.tummy.com/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
[prev in list] [next in list] [prev in thread] [next in thread]