'Re: [Linux-ha-dev] Measuring time interval required for failover'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha-dev
Subject:    Re: [Linux-ha-dev] Measuring time interval required for failover
From:       Lars Marowsky-Bree <lmb () suse ! de>
Date:       2007-01-07 23:15:52
Message-ID: 20070107231552.GB13938 () marowsky-bree ! de
[Download RAW message or body]

On 2007-01-05T10:10:23, Peter Wong <peter.wong@mobidia.com> wrote:

Hi Peter,

> I'm looking for ways of measuring the average time it 
> takes for the system to failover from the active node 
> to the standby node.
> 
> I have been asked to have the system fail over from 
> node A to node B and then after node B runs for a 
> while the system would fail over from node B to node A. 
> This flip-flop scenario would be carried out for say 
> 1000 times.

Well, you need to cause a real failure to measure something relevant -
for example, halt -nf on the node you want to kill, measure this time
until the service is available on the new node again.

This will measure the entire time for your specific environment:
heartbeat detection time, STONITH latency, service recovery time et
cetera.

Of course, if you want to measure the time for a graceful switch-over,
initiated by the admin, this can also be done. Simply call, say,
crm_resource -M and measure the time until the service is available
again.

1000 is quite often; you'll likely be able to get useful data for n=10
already. I doubt that the timings will have a large deviation. (I may be
wrong, though.)

> I have the following questions regarding this scenario:
> 
> 1. Has anyone done this sort of measurements before?

Not that I'm aware of. It'd be good to have something like this,
though.

> 2. Can Heartbeat handle this flip-flopping of >1000 
>    times between the nodes?

We should hope so!

> 3. Are there any scripts/code within the Heartbeat 
>    package that would assist in this situation?

Well, we have the tools to perform a switch-over (using crm_resource to
migrate the service). CTS also does things like this.

> 4. What is the correct way of measuring this time 
>    interval, between one node becomes non-operational 
>    and the other node becomes active?

See above.

> 5. In the log files produced by Heartbeat (ha-debug, 
>    ha-log), the time stamps have resolution in seconds. 
>    Is it possible to get a finer resolution, say 
>    milliseconds?

syslog-ng may be able to give you higher resolution. However, you'll be
performing operations which easily take several seconds to complete; I
doubt this will actually help you gain more relevant/better data.

Sincerely,
    Lars

-- 
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge."

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
[prev in list] [next in list] [prev in thread] [next in thread]