[prev in list] [next in list] [prev in thread] [next in thread]
List: linux-ha-dev
Subject: [Linux-ha-dev] Heartbeat hangs
From: Jos Vos <jos () xos ! nl>
Date: 2004-01-27 10:19:53
Message-ID: 20040127111953.B5475 () xos037 ! xos ! nl
[Download RAW message or body]
Hi,
On request of Lars I'm hereby posting my problem with heartbeat
(first 1.0.3, I now upgraded to 1.0.4) to this list:
One of my heartbeat nodes at some moment blocks (and doesn't send
out its packets). Also a gently shutdown doesn't work then, only
kill -9 helps...
This problem did occur frequently, but after changing the deadtime
from 30 to 120 seconds (which I consider very large, too large in
fact...) it only occurs seldomly. The systems often have a very
high load, which seems to trigger the problem.
I looked at the processes with strace, this is what I see (we
use heartbeat packets on eth1, no serial cable or so):
ps output:
=========================================================================
4715 ? S 0:00 heartbeat: heartbeat: control process
4720 ? SL 0:00 heartbeat: heartbeat: write: bcast eth1
4721 ? SL 0:00 heartbeat: heartbeat: read: bcast eth1
4722 ? S 0:00 heartbeat: heartbeat: master status process
=========================================================================
strace's of all these processes (for privacy reasons of the customer,
I changed the hostnames to xxx1 and xxx2 and the domain part to yyyyy):
=========================================================================
[root@xxx1 root]# strace -p 4715
write(5, ">>>\nt=NS_rexmit\ndest=xxx2.yyyyy"..., 156 <unfinished ...>
=========================================================================
=========================================================================
[root@xxx1 root]# strace -p 4720
recv(6, 0xbffff1b0, 4, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=6, events=0}], 1, 0) = 0
recv(6, 0xbffff1b0, 4, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=6, events=0}], 1, 0) = 0
poll( <unfinished ...>
=========================================================================
=========================================================================
[root@xxx1 root]# strace -p 4721
write(5, "!^!\neth1\n>>>\nt=NS_rexmit\ndest=xx"..., 165 <unfinished ...>
=========================================================================
=========================================================================
[root@xxx1 root]# strace -p 4722
write(10, ">>>\nt=NS_rexmit\ndest=xxx2.yyyyyy"..., 65) = ? ERESTARTSYS (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
write(10, ">>>\nt=NS_rexmit\ndest=xxx2.yyyyyy"..., 65) = ? ERESTARTSYS (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
write(10, ">>>\nt=NS_rexmit\ndest=xxx2.yyyyyy"..., 65) = ? ERESTARTSYS (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
...
=========================================================================
The last process is the only one I see doing something, but it
loops more or less continuously.
There was no problem for months, but since the systems were used
heavily, it occured multiple times a day with a deadtime of 30
and now once per two weeks or so with the new deadtime of 120.
Cheers,
--
-- Jos Vos <jos@xos.nl>
-- X/OS Experts in Open Systems BV | Phone: +31 20 6938364
-- Amsterdam, The Netherlands | Fax: +31 20 6948204
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic