'[Linux-ha-dev] Heartbeat hangs'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha-dev
Subject:    [Linux-ha-dev] Heartbeat hangs
From:       Jos Vos <jos () xos ! nl>
Date:       2004-01-27 10:19:53
Message-ID: 20040127111953.B5475 () xos037 ! xos ! nl
[Download RAW message or body]

Hi,

On request of Lars I'm hereby posting my problem with heartbeat
(first 1.0.3, I now upgraded to 1.0.4) to this list:

One of my heartbeat nodes at some moment blocks (and doesn't send
out its packets).  Also a gently shutdown doesn't work then, only
kill -9 helps...

This problem did occur frequently, but after changing the deadtime
from 30 to 120 seconds (which I consider very large, too large in
fact...) it only occurs seldomly.  The systems often have a very
high load, which seems to trigger the problem.

I looked at the processes with strace, this is what I see (we
use heartbeat packets on eth1, no serial cable or so):

ps output:

=========================================================================
 4715 ?        S      0:00 heartbeat: heartbeat: control process
 4720 ?        SL     0:00 heartbeat: heartbeat: write: bcast eth1
 4721 ?        SL     0:00 heartbeat: heartbeat: read: bcast eth1
 4722 ?        S      0:00 heartbeat: heartbeat: master status process
=========================================================================

strace's of all these processes (for privacy reasons of the customer,
I changed the hostnames to xxx1 and xxx2 and the domain part to yyyyy):

=========================================================================
[root@xxx1 root]# strace -p 4715
write(5, ">>>\nt=NS_rexmit\ndest=xxx2.yyyyy"..., 156 <unfinished ...>
=========================================================================

=========================================================================
[root@xxx1 root]# strace -p 4720
recv(6, 0xbffff1b0, 4, MSG_DONTWAIT)    = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=6, events=0}], 1, 0)          = 0
recv(6, 0xbffff1b0, 4, MSG_DONTWAIT)    = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=6, events=0}], 1, 0)          = 0
poll( <unfinished ...>
=========================================================================

=========================================================================
[root@xxx1 root]# strace -p 4721
write(5, "!^!\neth1\n>>>\nt=NS_rexmit\ndest=xx"..., 165 <unfinished ...>
=========================================================================

=========================================================================
[root@xxx1 root]# strace -p 4722
write(10, ">>>\nt=NS_rexmit\ndest=xxx2.yyyyyy"..., 65) = ? ERESTARTSYS (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
write(10, ">>>\nt=NS_rexmit\ndest=xxx2.yyyyyy"..., 65) = ? ERESTARTSYS (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
write(10, ">>>\nt=NS_rexmit\ndest=xxx2.yyyyyy"..., 65) = ? ERESTARTSYS (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
...
=========================================================================

The last process is the only one I see doing something, but it
loops more or less continuously.

There was no problem for months, but since the systems were used
heavily, it occured multiple times a day with a deadtime of 30
and now once per two weeks or so with the new deadtime of 120.

Cheers,

-- 
--    Jos Vos <jos@xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
[prev in list] [next in list] [prev in thread] [next in thread]