[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha
Subject:    Re: [Linux-HA] problem in stonith
From:       Alan Robertson <alanr () unix ! sh>
Date:       2004-03-31 22:08:42
Message-ID: 406B416A.7060607 () unix ! sh
[Download RAW message or body]

Hayden, Charles (Charles) wrote:
> This problem was observed in heartbeat-1.0.4.  I don’t know if it has 
> been fixed in 2.0.
> 
>  
> 
> If the stonith driver, for some reason, fails to execute successfully, 
> it is retried.  This logic is in StonithProcessDied in hb_resource.c.
> 
> If the driver is misconfigured or has some other problem that persists, 
> this retry goes on forever.
> 
> On my system, this happens so quickly that the syslog is overwhelmed, 
> and all you get is “dropped message” messages in it.
> 
>  
> 
> In this state, it becomes impossible to stop heartbeat:  “service 
> heartbeat stop” either hangs or returns, but if it returns and says 
> heartbeat is stopped, ps says otherwise.  Now there is no way to clean 
> things up without rebooting.
> 
>  
> 
> Ideally it should retry some maximum number of times, and then quit.  
> How it should report the state of the resources in this case I do not know.


There basically is nothing interesting that it can do at this point.  It 
cannot take over the resources.  I don't know if I think it should stop, 
but I agree it should't flood the logs at the kind of rate it's flooding 
them at now.

The moral of this story is have a reliable STONITH mechanism.

	First - STONITH should be pretty rare.
	Second - a STONITH failure is a multiple failure situation

I'm not willing to spend a lot of effort to "properly" recover from a 
multiple failure - because the resulting complexity typically has negative 
impacts on the reliability in normal cases.  But delaying a second or two 
shouldn't be a problem...

I'm also willing to invoke the STONITH "status" operation every few minutes 
as time goes on so that we can diagnose a STONITH misconfiguration or 
hardware failure before we need to use the STONITH code.

> Also, ideally it should wait, for instance for one second, before 
> calling Initiate_Reset in StonithProcessDied.
> 
> If I do this, then at least I can see what the problem is on syslog

Actually, the one thing that heartbeat can't (literally) do is sleep(3) for 
a second.  It needs to schedule the retry after a certain period of time 
(but without a call to sleep(3)).  I know you didn't say that literally, 
but I wanted to whine about it being a little more complicated than that... ;-)

-- 
     Alan Robertson <alanr@unix.sh>

"Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions." - William Wilberforce

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic