[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha-dev
Subject:    [Linux-ha-dev] Re: Heartbeat Problem
From:       Bill Bacher <bill_bacher () inlet ! com>
Date:       1999-10-27 14:15:24
[Download RAW message or body]

Alan,

Thanks for the quick response. Here's some more detail:

Quoting Alan Robertson <alanr@bell-labs.com>:

> Hi Bill,
> 
> Thanks for giving heartbeat a try!
> 
> > Subject: Heartbeat Problem
> > Date: Tue, 26 Oct 1999 15:15:29 -0500
> > From: Bill Bacher <bill_bacher@inlet.com>
> > 
> > Alan,
> > 
> > We're testing Heartbeat on two machines. Both run Apache Web Servers, and
> the
> > intent is they will offer up the same content, load sharing by IP address
> only.
> > We're using heartbeat to have one carry the full load if the other should
> die
> > for some reason. Both are configured to take over for the other under
> heartbeat
> > control.
> 
> OK.  This should work out fine, as long as you have your own mechanism for
> synchronizing your web servers data.

We think we have that covered ;-)
> 
> 
> > We've tried heartbeat 0.4.5 and 0.4.5a and have seen a strange problem
> with both
> > versions. We're seeing the IP address being re-assigned to the other
> machine
> > without being under heartbeat control. With version 0.4.5, we stopped
> heartbeat
> > altogether and still saw the IP addresses shifting between the two
> machines.
> > Running 0.4.5a today, we're seeing IP transfers but nothing is showing up
> in the
> > heartbeat logs indicating it is controlling things. It's almost as if the
> fake
> > part is running on its own.
> 
> What do you mean by shifting on it's own?  Do you mean it's showing up in
> ifconfig?
> Can you do an ifconfig, save the output, and do another one later, and it
> has
> changed without showing up in the heartbeat (ha-log/ha-debug) logs?

What appears to be happening is that the machine that takes over the IP address 
when the first one fails is never actually giving it back up. I had a script on 
both mchines that ran an ifconfig -a every 30 seconds and dumped it to a log. 
When one machine went down, the ha-log showed the second machine taking over and 
the ifconfig on that machine showed the same thing. When the down machine came 
back up 2 1/2 minutes later, the ha-log on the second machine indicated it saw 
the first one back, but ifconfig still showed it servicing the other IP address. 
Evidently, in this condition, when you hit the 1st machine either through an ssh 
or a web browser, it's kind of random as to which machine actually services the 
request, which is what I was interperting as randome IP address capture. At the 
risk of making this too long, I've included the logs and the results of ifconfig 
on the end of this. The clocks on the two machines are off by a minute or two 
if you're cross referencing time stamps.
>  
> > We're running Red Hat 6.0, Kernel 2.2.5-15smp on Dell servers with dual
> Pentium
> > II processors. We installed from the rpm.
> 
> Glad somebody does ;-)
>  
> > Any idea what might be going on? In one of your postings you mention
> looking at
> > 4 logs. What can we be watching besides ha-log, ha-debug, and messages?
> 
> Two logs * two machines = 4 logs :-)
Gotcha.
> > 
> > Thanks in advance for your assistance.
> 
> This is a new one.  If you've installed the "fake" package, you shouldn't
> have. 
> Heartbeat does it all...

Nope, only heartbeat.
> 
> Assuming that's not the case...
> 
> I have no idea where this might be happening, so I'll tell you a little
> about
> what is going on so you can verify what part our code might have in it...
> 
> Everything related to IP address takeover and giveback takes place in
> /etc/ha.d/resource.d/IPaddr.  If it didn't happen there, then heartbeat
> didn't
> do it.
> 
> In general, all resource scripts look a lot like /etc/rc.d/init.d
> startup/shutdown scripts, except some of them (notably IPaddr) require
> another
> argument.  When you start up apache, you do this: /etc/rc.d/init.d/httpd
> start. 
> To take over an IP address, you do this: /etc/ha.d/resource.d/IPaddr
> ip-address
> start.  To give it up, you do the same thing with "stop" instead of
> "start". 
> 
> If you look at the funciton ip_start(), you'll see that every time we take
> over
> an IP address, a message "INFO: ifconfig...", and a message "Sending
> Gratuitous
> Arp for ..." should occur EVERY time we take over an IP address.  If these
> don't
> occur, then there is an extremely high probability that we didn't perform
> an IP
> address takeover.
> 
> Exactly what did happen to you is harder to say...  But it seems unlikely
> that
> we did the dirty deed.
> 
> Please let us know what you find out.
> 
> 	Thanks!!
> 
> 	-- Alan Robertson
> 	   alanr@bell-labs.com
> 



---
Bill Bacher
UNIX Network Administrator
McLeodUSA Internetworks
319.790.5056 Phone
319.369.3089 Fax



/var/log/ha-log from 10.3.67.211 (2nd machine):
heartbeat: 1999/10/26_08:56:47 info: ***********************
heartbeat: 1999/10/26_08:56:47 info: Configuration validated. Starting 
heartbeat.
heartbeat: 1999/10/26_08:56:47 notice: Starting serial heartbeat on tty 
/dev/ttyS0
heartbeat: 1999/10/26_08:56:47 notice: UDP heartbeat started on port 1001 
interface eth0
heartbeat: 1999/10/26_08:56:47 error: Cannot open /proc/ha/.control: No such 
file or directory
heartbeat: 1999/10/26_09:01:27 info: Heartbeat shutdown in progress.
heartbeat: 1999/10/26_09:01:27 info: Giving up all HA resources.
heartbeat: 1999/10/26_09:01:27 info: All HA resources relinquished.
heartbeat: 1999/10/26_09:01:27 info: Heartbeat shutdown complete.
heartbeat: 1999/10/26_09:03:43 info: ***********************
heartbeat: 1999/10/26_09:03:43 info: Configuration validated. Starting 
heartbeat.
heartbeat: 1999/10/26_09:03:44 notice: Starting serial heartbeat on tty 
/dev/ttyS0
heartbeat: 1999/10/26_09:03:44 notice: UDP heartbeat started on port 1001 
interface eth0
heartbeat: 1999/10/26_09:03:44 error: Cannot open /proc/ha/.control: No such 
file or directory
heartbeat: 1999/10/26_16:35:30 warn: node intranet002.iw.mcld.net: is dead
heartbeat: 1999/10/26_16:35:30 INFO: Running /etc/ha.d/rc.d/status status
heartbeat: 1999/10/26_16:35:30 Taking over resource group 10.3.67.220
heartbeat: 1999/10/26_16:35:30 Acquiring resource group: intranet002.iw.mcld.net 
10.3.67.220
heartbeat: 1999/10/26_16:35:30 INFO: Running /etc/ha.d/resource.d/IPaddr 
10.3.67.220 start
heartbeat: 1999/10/26_16:35:30 INFO: ifconfig eth0:0 10.3.67.220 netmask 
255.255.255.128        broadcast 10.3.67.255
heartbeat: 1999/10/26_16:35:30 Sending Gratuitous Arp for 10.3.67.220 on eth0:0 
[eth0]
heartbeat: 1999/10/26_16:38:00 notice: node intranet002.iw.mcld.net seq restart 
1 vs 13760
heartbeat: 1999/10/26_16:38:00 info: node intranet002.iw.mcld.net: status 
unknown
heartbeat: 1999/10/26_16:38:00 INFO: Running /etc/ha.d/rc.d/status status

/var/log/ha-debug fro m10.3.67.211 (2nd machine):
heartbeat: 1999/10/26_09:15:17 debug: Got an NS_rexmit.
heartbeat: 1999/10/26_09:15:17 debug: Got an NS_rexmit.
heartbeat: 1999/10/26_16:35:30 Running /etc/ha.d/rc.d/status: status
heartbeat: 1999/10/26_16:35:30 Starting /etc/ha.d/resource.d/IPaddr 10.3.67.220 
start
heartbeat: 1999/10/26_16:35:40 /etc/ha.d/resource.d/IPaddr 10.3.67.220 start 
done. RC=0
heartbeat: 1999/10/26_16:38:00 Running /etc/ha.d/rc.d/status: status

Result of script that runs ifconfig -a every 30 seconds on 10.3.67.211 (2nd 
machine):
Tue Oct 26 16:35:32 CDT 1999
eth0      Link encap:Ethernet  HWaddr 00:20:35:E7:42:89  
          inet addr:10.3.67.211  Bcast:10.3.67.255  Mask:255.255.255.128
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:58047 errors:0 dropped:0 overruns:0 frame:0
          TX packets:17086 errors:0 dropped:0 overruns:40 carrier:18
          collisions:4693 txqueuelen:100 
          Interrupt:19 Base address:0xe4e0 

eth0:0    Link encap:Ethernet  HWaddr 00:20:35:E7:42:89  
          inet addr:10.3.67.220  Bcast:10.3.67.255  Mask:255.255.255.128
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:19 Base address:0xe4e0 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:3924  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 

This continued through the night and is still the case:

Wed Oct 27 08:19:52 CDT 1999
eth0      Link encap:Ethernet  HWaddr 00:20:35:E7:42:89  
          inet addr:10.3.67.211  Bcast:10.3.67.255  Mask:255.255.255.128
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1896892 errors:0 dropped:0 overruns:0 frame:0
          TX packets:964717 errors:0 dropped:0 overruns:44 carrier:189
          collisions:1161224 txqueuelen:100 
          Interrupt:19 Base address:0xe4e0 

eth0:0    Link encap:Ethernet  HWaddr 00:20:35:E7:42:89  
          inet addr:10.3.67.220  Bcast:10.3.67.255  Mask:255.255.255.128
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:19 Base address:0xe4e0 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:3924  Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 



/var/log/ha-log on 10.3.67.220 (1st machine):
heartbeat: 1999/10/26_08:55:19 info: ***********************
heartbeat: 1999/10/26_08:55:19 info: Configuration validated. Starting 
heartbeat.
heartbeat: 1999/10/26_08:55:19 notice: Starting serial heartbeat on tty 
/dev/ttyS0
heartbeat: 1999/10/26_08:55:19 notice: UDP heartbeat started on port 1001 
interface eth0
heartbeat: 1999/10/26_08:55:19 error: Cannot open /proc/ha/.control: No such 
file or directory
heartbeat: 1999/10/26_09:00:10 warn: node freeweb.iw.mcld.net: is dead
heartbeat: 1999/10/26_09:00:10 INFO: Running /etc/ha.d/rc.d/status status
heartbeat: 1999/10/26_09:00:10 Taking over resource group 10.3.67.211
heartbeat: 1999/10/26_09:00:10 Acquiring resource group: freeweb.iw.mcld.net 
10.3.67.211
heartbeat: 1999/10/26_09:00:10 INFO: Running /etc/ha.d/resource.d/IPaddr 
10.3.67.211 start
heartbeat: 1999/10/26_09:00:10 INFO: ifconfig eth0:0 10.3.67.211 netmask 
255.255.255.128        broadcast 10.3.67.255
heartbeat: 1999/10/26_09:00:10 Sending Gratuitous Arp for 10.3.67.211 on eth0:0 
[eth0]
heartbeat: 1999/10/26_09:02:17 notice: node freeweb.iw.mcld.net seq restart 1 vs 
141
heartbeat: 1999/10/26_09:02:17 info: node freeweb.iw.mcld.net: status unknown
heartbeat: 1999/10/26_09:02:17 INFO: Running /etc/ha.d/rc.d/status status
heartbeat: 1999/10/26_09:12:21 warn: node freeweb.iw.mcld.net: is dead
heartbeat: 1999/10/26_09:12:21 INFO: Running /etc/ha.d/rc.d/status status
heartbeat: 1999/10/26_09:12:21 Taking over resource group 10.3.67.211
heartbeat: 1999/10/26_09:13:50 error: 49 lost packet(s) for 
[freeweb.iw.mcld.net] [298:348]
heartbeat: 1999/10/26_09:13:50 info: node freeweb.iw.mcld.net: status unknown
heartbeat: 1999/10/26_09:13:50 INFO: Running /etc/ha.d/rc.d/status status
heartbeat: 1999/10/26_16:33:56 info: Heartbeat shutdown in progress.
heartbeat: 1999/10/26_16:33:56 info: Giving up all HA resources.
heartbeat: 1999/10/26_16:33:56 info: All HA resources relinquished.
heartbeat: 1999/10/26_16:33:56 info: Heartbeat shutdown complete.
heartbeat: 1999/10/26_16:36:27 info: ***********************
heartbeat: 1999/10/26_16:36:27 info: Configuration validated. Starting 
heartbeat.
heartbeat: 1999/10/26_16:36:27 notice: Starting serial heartbeat on tty 
/dev/ttyS0
heartbeat: 1999/10/26_16:36:27 notice: UDP heartbeat started on port 1001 
interface eth0
heartbeat: 1999/10/26_16:36:27 error: Cannot open /proc/ha/.control: No such 
file or directory

/var/log/ha-debug from 10.3.67.220 (1st machine):
heartbeat: 1999/10/26_09:00:10 Running /etc/ha.d/rc.d/status: status
heartbeat: 1999/10/26_09:00:10 Starting /etc/ha.d/resource.d/IPaddr 10.3.67.211 
start
heartbeat: 1999/10/26_09:00:20 /etc/ha.d/resource.d/IPaddr 10.3.67.211 start 
done. RC=0
heartbeat: 1999/10/26_09:02:17 Running /etc/ha.d/rc.d/status: status
heartbeat: 1999/10/26_09:12:21 Running /etc/ha.d/rc.d/status: status
heartbeat: 1999/10/26_09:13:50 debug: Got an NS_rexmit.
heartbeat: 1999/10/26_09:13:50 debug: Got an NS_rexmit.
heartbeat: 1999/10/26_09:13:51 Running /etc/ha.d/rc.d/status: status

ifconfig -a on 10.3.67.220 (1st machine)
eth0      Link encap:Ethernet  HWaddr 00:90:27:3A:80:C4  
          inet addr:10.3.67.220  Bcast:10.3.67.255  Mask:255.255.255.128
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:50719 errors:0 dropped:0 overruns:0 frame:0
          TX packets:32010 errors:0 dropped:0 overruns:0 carrier:17
          collisions:7294 txqueuelen:100 
          Interrupt:17 Base address:0xdce0 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:3924  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 


_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
http://lists.tummy.com/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic