[prev in list] [next in list] [prev in thread] [next in thread]
List: linux-ha-dev
Subject: [Linux-ha-dev] Bugzilla Bug 835: 1.2.4 CTS fails on ppc64 -
From: Horms <horms () verge ! net ! au>
Date: 2005-08-29 8:26:05
Message-ID: 20050829082603.GA22298 () verge ! net ! au
[Download RAW message or body]
Hi,
in my quest to get 1.2.4 out the door I have been examining
Bug 835: 1.2.4 CTS fails on ppc64 - SplitBrain error. One
of two bugs logged against 1.2.4 in the BTS (The other is Bug 559
1.2.4 CTS fails on ppc64 - node not stopping)
See:
http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=835
http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=559
And in particularl I have been examining the log posted by David.
I will be refering to line numbers in that log, you can get them
by dowloading the file and running it through cat -n.
http://www.osdl.org/developer_bugzilla/attachment.cgi?id=464&action=view
I believe that the problem is as follows:
1. ipfail detects that the otherside is down and issues a
T_ASKRESOURCES message. This takes place in giveup()
of contrib/ipfail/ipfail.c and is evident in the following
log lines:
Line 82 Aug 18 18:49:05 halp12 ipfail[9681]: info: giveup() called (timeout \
worked) Line 84 Aug 18 18:49:05 halp12 ipfail[9681]: debug: Message [ask_resources] \
sent. Line 85 Aug 18 18:49:05 halp12 ipfail[9681]: debug: giveup timeout has been \
destroyed.
2. This is consumbed by the local hartbeat process, and subsequently
local (none) and foreign (9.3.189.190) resources are aquired by
halp12.
The local reception of this message can be seen from line 86
Line 86 Aug 18 18:49:05 halp12 heartbeat[9669]: debug: Received standby message me \
from halp12 in state 0 Line 87 Aug 18 18:49:05 halp12 heartbeat[9669]: debug: \
ask_for_resources: other now unstable Line 88 Aug 18 18:49:05 halp12 \
heartbeat[9669]: info: halp12 wants to go standby [foreign]
3. However this message is not recived by halp11 until it is
retransmitted by halp12 around line 330
Line 330 Aug 18 18:49:16 halp11 heartbeat[9535]: info: Retransmitting pkt 100
Line 331 Aug 18 18:49:16 halp11 heartbeat[9535]: info: Retransmitting pkt 101
...
Line 336 Aug 18 18:49:16 halp11 heartbeat[9535]: debug: Received standby message \
me from halp12 in state 0 Line 337 Aug 18 18:49:16 halp11 heartbeat[9535]: debug: \
ask_for_resources: other now unstable Line 338 Aug 18 18:49:16 halp11 \
heartbeat[9535]: info: halp12 wants to go standby [foreign]
4. At this point halp11 has going_standby=NOT and executes the
folloing code around line 1579 of heartbeat/hb_resource.c
}else{
if (ANYDEBUG) {
cl_log(LOG_INFO
, "standby"
": other_holds_resources: %d"
, other_holds_resources);
}
/* Other node wants to go standby */
going_standby = OTHER;
send_standby_msg(going_standby);
standby_running = add_longclock(now
, standby_rsc_to);
}
Which cases a T_ASKRESOURCES other message to be sent, and
going_standby to be set to other. We can see this on lines 341, 343
and 344 of the log.
Line 341 Aug 18 18:49:16 halp11 heartbeat[9535]: info: standby: \
other_holds_resources: 0
...
Line 343 Aug 18 18:49:16 halp11 heartbeat[9535]: debug: Sending standby [other] \
msg Line 344 Aug 18 18:49:16 halp11 heartbeat[9535]: info: New standby state: 2
5. This T_ASKRESOURCES is recived by halp12 which flags an error
because getting a T_ASKRESOURCES other while going_standby
is NOT, is bad. This code is also in ask_for_resources().
Line 1592-1596 of heartbeat/hb_resource.c
}else{
message_ignored = 1;
}
break;
...
Line 1690-1694 of heartbeat/hb_resource.c
if (message_ignored){
cl_log(LOG_ERR
, "Ignored standby message '%s' from %s in state %d"
, info, from, orig_standby);
}
And the relevant log is line 348
Line 348 Aug 18 18:49:16 halp12 heartbeat[9669]: ERROR: Ignored standby message \
'other' from halp11 in state 0
So in a nutshell halp12 sends a message to halp11 which gets delayed.
Once the two nodes can communicate again, halp11 finally gets the
message and responds. But by that time the response to the message is
invalid and halp12 flags an error. However, it seems that the cluster is
actually in a valid state the entire time. And it would be more
appropriate for halp12 to just ignore this message - which essentially
is a delayed ack for a state it has already transitioned to.
Perhaps halp12 should keep track of outstanding T_ASKRESOURCES sent
(by ipfail), that have not yet been acked by the other end. Perhaps
that is too tedious. I'm not sure which is why I am posting here.
--
Horms
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic