[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha-dev
Subject:    Re: [Linux-ha-dev] Re: [Bug 957] It is a mistake for the CRM to rely
From:       Lars Marowsky-Bree <lmb () suse ! de>
Date:       2005-11-30 10:09:38
Message-ID: 20051130100938.GD11720 () marowsky-bree ! de
[Download RAW message or body]

On 2005-11-19T04:23:15, Joachim Banzhaf <joachimbanzhaf@compuserve.de> wrote:

> First, my point of view concerning RA returncodes. Maybe I just have to catch 
> up on you first. Feel free to just ignore my inline comments then:

Thanks for your feedback!

> monitor
> =======
> 
> there are two cases
> 
> 1) continuous monitoring
> 
> healthy (rc == 0) -> OK
> other (rc != 0) -> Error (could be stopped or undefined or unhealthy)
> potential optimization: if it is stopped cleanly (rc == 7), the usual stop 
> action following that can be avoided.

Right. This was "A" I was refering to in my post. The optimization
basically comes for free, because we already require the monitor
operation to adhere to the spec for other reasons.

However, one can argue that because the expectation that the resource be
healthy and running (afterall, we started it successfully!) is violated,
we still should treat this as a "failure" and throw in a stop for good
measure to make sure it's indeed cleaned up completely. As "stop" is
required to be idempotent, this can't possibly have any bad side
effects.

> 2) initial probing
> 
> stopped (rc == 7) -> spare initial stop (this is optional optimization)
> healthy (rc == 0) -> spare initial start (hb could adopt already running 
> service without service interruption. Can PE deal with that right now?)
> other (rc !=0 && rc != 7) -> need initial stop, just in case...

Well, actually, this isn't quite true. You're missing the fact that, if
a resource is found active _more than N times_ (where N is "1" for most
anything but clones) during probing, the recovery needed is not just an
initial stop on one node, but potentially marking the entire resource as
failed until manual intervention has fixed it.

(ie, non-cluster filesystem found to be mounted more than once woah,
there be dragons, we won't touch it.)

And yes, if a resource is found to be started/healthy, we'll skip the
start. Again, this is not only an optimisation, because if it is started
and healthy on exactly one node, we can't just start it somewhere else,
but need to migrate (stop/start) it or leave it where it is. 

(This is important foundation for "re-attaching" after the CRM/LRM/hb
have been shutdown/detached from the resources, upgraded and restarted,
too.)

So, on probing, the stopped/healthy/failed distinction is quite
important.

> problem: script could rely on other resources being started (e.g. drbd or 
> filesystem) which may not be the case at that time. 

This isn't a problem. In that case, the probing simply will return
"stopped" or "failed". ie, if the pre-requisites ain't there, the
resource either can't be running at all (can't have started an
application which resides on a non-mounted filesystem), or failed
(something running, but some filesystem isn't mounted so the app is
stuck).

> start
> =====
> 
> healthy (rc == 0) -> OK (even if it was running before)
> other (rc != 0) -> Error (could be stopped, unhealthy or undefined)
> potential optimization: if it is stopped cleanly (rc == 7), the usual stop 
> action following that can be avoided.

Well, a "start" resulting in a "stopped" exit code hasn't succeeded and
thus is still failed. But yes, it probably could do this optimization,
though I'm not big on it - if something failed, we ought to clean it up.
(Same as for the monitor above.)

> stop
> ====
> 
> stopped (rc == 0) -> OK, even if it was stopped before
> other (rc != 0) -> Error, resource fencing failed, need node fencing -> have 
> to reboot node

True. Just to clarify, the recovery policy here is configurable.

> problem: script could rely on other resources being started (e.g. drbd or 
> filesystem) which may not be the case at that time. 

Not a problem.

> From the above, I don't see why I would need to rely on different returncodes 
> but zero and not zero, except for optimizations which "only" 
> save me a stop operation. So I think the subject is right. CRM should not 
> _rely_ on the returncode 7 but should take advantage of it.

See above, for probing, this distinction is crucial to determine whether
some of our constraints have been violated.

> I have far more problems with a failed stop which admittedly may be triggered 
> by a monitor not returning 7:
> Most stop operations out in the field return rc !=0 even if resources are 
> stopped for sure (e.g binary or config not found).

Then they are broken and need to be fixed. "stop" MUST do its utmost to
stop the resource - if it can't call the binary, it must use kill, and
if SIGTERM doesn't work, SIGKILL. If ultimately it tells us that the
resource could not be stopped, then we'll escalate depending on the
policy.

There's just so much slack we can give the RAs; some requirements -
stop, start being idempotent and robust - can't be negotiated.

> This could easily happen, 
> if the service is on a drbd device or filesystem which is not yet available 
> during initial probing. In these cases the returncodes would even be lsb 
> compliant. Rebooting however is inacceptable in these cases. So at least 
> while probing and for legacy scripts it would make sense to allow for the 
> following LSB returncodes to mean success:
> 5	program is not installed
> 6	program is not configured
> and maybe (although that is not an LSB compliant returncode for stop):
> 7	program is not running
> I have not checked whether this is done right now.

Admittedly, these two (5 & 6) suck. ;-) "Not installed" doesn't mean
that no daemon is running - the daemon could have been started before
the binary was deleted, for example.

We may be better off cutting these two from the spec...

> In general, I think it should work with legacy scripts - just in a less 
> efficient way.

"legacy" scripts which are broken continue to be broken. Nothing new
here. For example, heartbeat v1 already will "blow up" for scripts which
don't stop stopped resources. There's some heuristic in there -
if a stop fails N times, v1 tries a status and if that returns
"stopped", the reboot is avoided -, but ultimately, those are already
broken.

Same for resources being active prior to startup. Only "stopped" is
acceptable there.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic