[prev in list] [next in list] [prev in thread] [next in thread]
List: linux-ha
Subject: Re: [Linux-HA] problem in stonith
From: Alan Robertson <alanr () unix ! sh>
Date: 2004-03-31 22:08:42
Message-ID: 406B416A.7060607 () unix ! sh
[Download RAW message or body]
Hayden, Charles (Charles) wrote:
> This problem was observed in heartbeat-1.0.4. I don’t know if it has
> been fixed in 2.0.
>
>
>
> If the stonith driver, for some reason, fails to execute successfully,
> it is retried. This logic is in StonithProcessDied in hb_resource.c.
>
> If the driver is misconfigured or has some other problem that persists,
> this retry goes on forever.
>
> On my system, this happens so quickly that the syslog is overwhelmed,
> and all you get is “dropped message” messages in it.
>
>
>
> In this state, it becomes impossible to stop heartbeat: “service
> heartbeat stop” either hangs or returns, but if it returns and says
> heartbeat is stopped, ps says otherwise. Now there is no way to clean
> things up without rebooting.
>
>
>
> Ideally it should retry some maximum number of times, and then quit.
> How it should report the state of the resources in this case I do not know.
There basically is nothing interesting that it can do at this point. It
cannot take over the resources. I don't know if I think it should stop,
but I agree it should't flood the logs at the kind of rate it's flooding
them at now.
The moral of this story is have a reliable STONITH mechanism.
First - STONITH should be pretty rare.
Second - a STONITH failure is a multiple failure situation
I'm not willing to spend a lot of effort to "properly" recover from a
multiple failure - because the resulting complexity typically has negative
impacts on the reliability in normal cases. But delaying a second or two
shouldn't be a problem...
I'm also willing to invoke the STONITH "status" operation every few minutes
as time goes on so that we can diagnose a STONITH misconfiguration or
hardware failure before we need to use the STONITH code.
> Also, ideally it should wait, for instance for one second, before
> calling Initiate_Reset in StonithProcessDied.
>
> If I do this, then at least I can see what the problem is on syslog
Actually, the one thing that heartbeat can't (literally) do is sleep(3) for
a second. It needs to schedule the retry after a certain period of time
(but without a call to sleep(3)). I know you didn't say that literally,
but I wanted to whine about it being a little more complicated than that... ;-)
--
Alan Robertson <alanr@unix.sh>
"Openness is the foundation and preservative of friendship... Let me claim
from you at all times your undisguised opinions." - William Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic