'[Linux-ha-dev] Bugzilla Bug 835: 1.2.4 CTS fails on ppc64 -'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha-dev
Subject:    [Linux-ha-dev] Bugzilla Bug 835: 1.2.4 CTS fails on ppc64 -
From:       Horms <horms () verge ! net ! au>
Date:       2005-08-29 8:26:05
Message-ID: 20050829082603.GA22298 () verge ! net ! au
[Download RAW message or body]

Hi,

in my quest to get 1.2.4 out the door I have been examining
Bug 835: 1.2.4 CTS fails on ppc64 - SplitBrain error. One
of two bugs logged against 1.2.4 in the BTS (The other is Bug 559
1.2.4 CTS fails on ppc64 - node not stopping)

See:
  http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=835
  http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=559

And in particularl I have been examining the log posted by David.
I will be refering to line numbers in that log, you can get them
by dowloading the file and running it through cat -n.
  http://www.osdl.org/developer_bugzilla/attachment.cgi?id=464&action=view

I believe that the problem is as follows:

  1. ipfail detects that the otherside is down and issues a
     T_ASKRESOURCES message. This takes place in giveup()
     of contrib/ipfail/ipfail.c and is evident in the following
     log lines:

  Line 82  Aug 18 18:49:05 halp12 ipfail[9681]: info: giveup() called (timeout \
worked)  Line 84  Aug 18 18:49:05 halp12 ipfail[9681]: debug: Message [ask_resources] \
sent.  Line 85  Aug 18 18:49:05 halp12 ipfail[9681]: debug: giveup timeout has been \
destroyed.

  2. This is consumbed by the local hartbeat process, and subsequently
     local (none) and foreign (9.3.189.190) resources are aquired by
     halp12.

     The local reception of this message can be seen from line 86

  Line 86  Aug 18 18:49:05 halp12 heartbeat[9669]: debug: Received standby message me \
from halp12 in state 0    Line 87  Aug 18 18:49:05 halp12 heartbeat[9669]: debug: \
ask_for_resources: other now unstable   Line 88  Aug 18 18:49:05 halp12 \
heartbeat[9669]: info: halp12 wants to go standby [foreign] 

  3. However this message is not recived by halp11 until it is
     retransmitted by halp12 around line 330

  Line 330  Aug 18 18:49:16 halp11 heartbeat[9535]: info: Retransmitting pkt 100 
  Line 331  Aug 18 18:49:16 halp11 heartbeat[9535]: info: Retransmitting pkt 101 
  ...
  Line 336  Aug 18 18:49:16 halp11 heartbeat[9535]: debug: Received standby message \
me from halp12 in state 0    Line 337  Aug 18 18:49:16 halp11 heartbeat[9535]: debug: \
ask_for_resources: other now unstable   Line 338  Aug 18 18:49:16 halp11 \
heartbeat[9535]: info: halp12 wants to go standby [foreign] 

  4. At this point halp11 has going_standby=NOT and executes the
     folloing code around line 1579 of heartbeat/hb_resource.c

                       }else{
                                if (ANYDEBUG) {
                                        cl_log(LOG_INFO
                                        ,       "standby"
                                        ": other_holds_resources: %d"
                                        ,       other_holds_resources);
                                }
                                /* Other node wants to go standby */
                                going_standby = OTHER;
                                send_standby_msg(going_standby);
                                standby_running = add_longclock(now
                                ,       standby_rsc_to);
                        }

  Which cases a T_ASKRESOURCES other message to be sent, and
  going_standby to be set to other.  We can see this on lines 341, 343
  and 344 of the log.

  Line 341  Aug 18 18:49:16 halp11 heartbeat[9535]: info: standby: \
                other_holds_resources: 0 
  ...
  Line 343  Aug 18 18:49:16 halp11 heartbeat[9535]: debug: Sending standby [other] \
msg   Line 344  Aug 18 18:49:16 halp11 heartbeat[9535]: info: New standby state: 2 

  5. This T_ASKRESOURCES is recived by halp12 which flags an error
     because getting a T_ASKRESOURCES other while going_standby
     is NOT, is bad. This code is also in ask_for_resources().

     Line 1592-1596 of heartbeat/hb_resource.c
                }else{          
                        message_ignored = 1;
                }       
                break;
                ...
     Line 1690-1694 of heartbeat/hb_resource.c
        if (message_ignored){
                cl_log(LOG_ERR
                ,       "Ignored standby message '%s' from %s in state %d"
                ,       info, from, orig_standby);
        }

  And the relevant log is line 348

  Line 348  Aug 18 18:49:16 halp12 heartbeat[9669]: ERROR: Ignored standby message \
'other' from halp11 in state 0 

So in a nutshell halp12 sends a message to halp11 which gets delayed.
Once the two nodes can communicate again, halp11 finally gets the
message and responds.  But by that time the response to the message is
invalid and halp12 flags an error. However, it seems that the cluster is
actually in a valid state the entire time. And it would be more
appropriate for halp12 to just ignore this message - which essentially
is a delayed ack for a state it has already transitioned to.

Perhaps halp12 should keep track of outstanding T_ASKRESOURCES sent
(by ipfail), that have not yet been acked by the other end. Perhaps
that is too tedious. I'm not sure which is why I am posting here.

-- 
Horms
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[prev in list] [next in list] [prev in thread] [next in thread]