[prev in list] [next in list] [prev in thread] [next in thread] 

List:       keepalived-devel
Subject:    Re: [Keepalived-devel] Bug in vrrp.c for configurations with the same priority on every node?
From:       "Ryan O'Hara" <rohara () redhat ! com>
Date:       2012-12-06 16:38:49
Message-ID: 50C0CA19.8030800 () redhat ! com
[Download RAW message or body]

On 12/05/2012 04:06 PM, Boon Ang wrote:
> Hello Ryan,
> 
> Thanks very much for looking into this and verifying the fix.  Appreciate
> it.
> 
> Yes, it will be great if you submit the fix into the development tree.
> Will the fix only go into the next, newest (and subsequent) release
> version?

I would assume that it will go into the next release, but I can't say 
for sure. I don't maintain keepalived, I simply submit patches. There 
are several patches in my devel branch and a pull request. Hopefully 
they will all be in the next release.

One thing I noticed while experimenting with your patch is that failover 
is not as deterministic as I would've thought. For example, say my four 
nodes the same priority and the IP addresses are as follows:

node #1: 192.168.1.1
node #2: 192.168.1.2
node #3: 192.168.1.3
node #4: 192.168.1.4

Assume that node #1 is in master state, all others in backup state. If 
you kill keepalived on node #1, we expect node #4 to become master since 
it has "greater" IP address. This is not always the case. More than once 
I have seen node #2 become the new master. Have you noticed this?

Ryan


> On Wed, Dec 5, 2012 at 1:05 PM, Ryan O'Hara<rohara@redhat.com>  wrote:
> 
> > Boon,
> > 
> > I ran the test again, this time using physical machines. Your patch is
> > correct. Evidently the keepalived configuration I have running on virtual
> > machines is experiencing odd multicast behavior once again. Apologies for
> > any confusion.
> > 
> > If you like, I'd be happy to include your patch in my development git tree
> > and submit a pull request. Let me know.
> > 
> > Thanks for reporting this bug and providing code.
> > 
> > Ryan
> > 
> > 
> > 
> > On 12/05/2012 12:18 PM, Boon Ang wrote:
> > 
> > > Hello Ryan,
> > > 
> > > I was using a 3-node cluster, but I don't think 3 or 4 nodes really matter
> > > here.
> > > 
> > > In my setup,
> > > * One node is configured to start as master, while the other 2 nodes
> > > start as backup.
> > > * my setup uses a health-check script for vrrp, and in my recent
> > > tests,
> > > I artificially trigger a health check failure on the master.  (Earlier on,
> > > I had tried killing the entire master node/VM and saw the same problem of
> > > multiple masters.)
> > > * With my patch, there will still be a transient window when the other
> > > two nodes become master.  But that transient window ends when one master
> > > sees that it's IP address has lower priority.  The following are excerpts
> > > from the syslogs of the 3 nodes.  This is consistent with my reading of
> > > rfc
> > > 2338.
> > > 
> > > (1) Initially, controller2-41 was the master.  I artificially trigger a
> > > health_check failure.  Here's its relevant syslog.  Note that it enters
> > > the
> > > fault state as a result. Subsequently, health_check succeeds and it enters
> > > the backup state.
> > > 
> > > 2012-12-05T09:52:14.424800-08:**00 controller2-41 Keepalived_vrrp:
> > > VRRP_Script(health_check) failed
> > > 2012-12-05T09:52:15.379385-08:**00 controller2-41 Keepalived_vrrp:
> > > VRRP_Instance(VI_1) Entering FAULT STATE
> > > ...
> > > 2012-12-05T09:52:15.379956-08:**00 controller2-41 Keepalived_vrrp:
> > > VRRP_Instance(VI_1) Now in FAULT state
> > > 2012-12-05T09:52:15.382356-08:**00 controller2-41 Keepalived_vrrp:
> > > VRRP_Script(health_check) succeeded
> > > 2012-12-05T09:52:17.990549-08:**00 controller2-41 Keepalived_vrrp:
> > > VRRP_Instance(VI_1) Entering BACKUP STATE
> > > ...
> > > 
> > > (2) As a result of this, the other two nodes respond.  First, let's look
> > > at
> > > controller3-24.  Here's its syslog.  Note that it transitions to master
> > > state and stays there.  There is an additional log line that I added for
> > > debugging purposes.  It logs the priority and addresses at the point where
> > > my patch changes the address comparison.  The log of saddr's from hd and
> > > vrrp without my patch is what showed quite clearly that there is a
> > > byte-ordering issue in the saddr comparison.
> > > 
> > > 2012-12-05T09:52:17.986025-08:**00 controller3-41 Keepalived_vrrp:
> > > VRRP_Instance(VI_1) Transition to MASTER STATE
> > > 2012-12-05T09:52:17.986181-08:**00 controller3-41 Keepalived_vrrp:
> > > VRRP_Instance(VI_1); priority [hd:100; vrrp: 100]; saddr [hd:0xac00529;
> > > vrrp 0xac0052b]
> > > 2012-12-05T09:52:18.986905-08:**00 controller3-41 Keepalived_vrrp:
> > > VRRP_Instance(VI_1) Entering MASTER STATE
> > > ...
> > > 
> > > 
> > > (3) Finally, let's look at the 3rd node, controller1-24.  Here's the
> > > syslog.  Note that initially, it transitions into the master state because
> > > there was no longer any master advertisement, just like on controller3-24.
> > > But it then receives the advertisement from controller-3-24, and upon
> > > comparing the priority and saddr's, it sees that controller3-24's
> > > advertisement has higher priority (taking into account their respective
> > > saddr's).   As a result, it enters backup state.
> > > 
> > > 2012-12-05T09:52:17.988221-08:**00 controller1-41 Keepalived_vrrp:
> > > VRRP_Instance(VI_1) Transition to MASTER STATE
> > > 2012-12-05T09:52:17.988927-08:**00 controller1-41 Keepalived_vrrp:
> > > VRRP_Instance(VI_1); priority [hd:100; vrrp: 100]; saddr [hd:0xac0052b;
> > > vrrp 0xac00529]
> > > 2012-12-05T09:52:17.988978-08:**00 controller1-41 Keepalived_vrrp:
> > > VRRP_Instance(VI_1) Received higher prio advert
> > > 2012-12-05T09:52:17.989023-08:**00 controller1-41 Keepalived_vrrp:
> > > VRRP_Instance(VI_1) Entering BACKUP STATE
> > > ...
> > > 
> > > Here's the patch I used for the above experiment
> > > 
> > > 831,835c831
> > > <         } else {
> > > <           log_message(LOG_INFO, "VRRP_Instance(%s); priority [hd:%d;
> > > vrrp:
> > > %d]; saddr [hd:0x%x; vrrp 0x%x]",
> > > <                       vrrp->iname, hd->priority,
> > > vrrp->effective_priority,
> > > <                       ntohl(iph->saddr), ntohl(VRRP_PKT_SADDR(vrrp)));
> > > <           if (hd->priority>   vrrp->effective_priority ||
> > > ---
> > > 
> > > > } else if (hd->priority>   vrrp->effective_priority ||
> > > > 
> > > 837c833
> > > <                     ntohl(iph->saddr)>   ntohl(VRRP_PKT_SADDR(vrrp)))) {
> > > ---
> > > 
> > > > ntohl(iph->saddr)>   VRRP_PKT_SADDR(vrrp))) {
> > > > 
> > > 854d849
> > > <           }
> > > 
> > > I guess the question is whether your test resulted in the other 3nodes
> > > staying in master state, or only were in master state for a transient
> > > moment?
> > > 
> > > Thanks.
> > > Boon
> > > 
> > > 
> > > On Wed, Dec 5, 2012 at 8:55 AM, Ryan O'Hara<rohara@redhat.com>   wrote:
> > > 
> > > On 12/03/2012 12:27 PM, Boon Ang wrote:
> > > > 
> > > > > Hello,
> > > > > 
> > > > > I'm seeing what I believe is a bug in vrrp in keepalived.  I have 3
> > > > > nodes
> > > > > configured with the same priority.  When a failover occurs, I often see
> > > > > both remaining nodes becoming masters.  According to rfc2338, when the
> > > > > priority is the same, the IP addresses of the nodes should be used as
> > > > > tie
> > > > > breaker.
> > > > > 
> > > > > The version I've been experimenting with is 1.1.20.  I believe the bug
> > > > > is
> > > > > due to the saddr comparison in vrrp_state_master_rx where the byte
> > > > > 
> > > > ordering
> > > > 
> > > > > used in the comparison is networking ordering in one operand, and host
> > > > > ordering in the other.   If the hosts have addresses such as 10.10.0.80,
> > > > > 10.10.0.81, and 10.10.0.82,on an x86 machines (Little Endian) the
> > > > > comparison ends up comparing things like 10.10.0.80 vs say 81.0.0.10.10,
> > > > > etc. and each host ends up thinking its own address is
> > > > > 
> > > > > The following diff fixes the problem.
> > > > > 
> > > > > 833c833
> > > > > <         ntohl(iph->saddr)>    ntohl(VRRP_PKT_SADDR(vrrp)))) {
> > > > > ---
> > > > > 
> > > > > > ntohl(iph->saddr)>    VRRP_PKT_SADDR(vrrp))) {
> > > > > > 
> > > > > 
> > > > > It appears to me that this problem is still there in 1.2.7 from a quick
> > > > > inspection of the code in 1.2.7.
> > > > > 
> > > > > Can someone please comment on this?
> > > > > 
> > > > 
> > > > I've configured four directors all with the same priority. When I kill
> > > > keepalived on the master, the other three director all transition to
> > > > master state. This happens even with your code change. Your code might
> > > > be correct, but it doesn't appear to be a complete solution. I'm still
> > > > investigating, so too soon to say what is causing this behavior.
> > > > 
> > > > How are you triggering failover? Are all of your directors configured
> > > > with "state BACKUP"?
> > > > 
> > > > Ryan
> > > > 
> > > > 
> > > > ------------------------------**------------------------------**
> > > > ------------------
> > > > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> > > > Remotely access PCs and mobile devices and provide instant support
> > > > Improve your efficiency, and focus on delivering more value-add services
> > > > Discover what IT Professionals Know. Rescue delivers
> > > > http://p.sf.net/sfu/logmein_**12329d2d<http://p.sf.net/sfu/logmein_12329d2d>
> > > > ______________________________**_________________
> > > > Keepalived-devel mailing list
> > > > Keepalived-devel@lists.**sourceforge.net<Keepalived-devel@lists.sourceforge.net>
> > > >  https://lists.sourceforge.net/**lists/listinfo/keepalived-**devel<https://lists.sourceforge.net/lists/listinfo/keepalived-devel>
> > > >  
> > > > 
> > > 
> > 
> 


------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
_______________________________________________
Keepalived-devel mailing list
Keepalived-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/keepalived-devel


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic