[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha-dev
Subject:    Re: [Linux-ha-dev] Monitoring resources
From:       Alan Robertson <alanr () unix ! sh>
Date:       2002-02-05 22:01:54
[Download RAW message or body]

Ragnar Kjørstad wrote:
> 
> On Tue, Feb 05, 2002 at 08:30:44AM -0700, Alan Robertson wrote:
> > Here are the basic requirements for a resource monitoring service:
> 
> IMHO there is no need to split resource management and monitoring in
> heartbeat, because they will be very tightly coupled and the montoring
> part is fairly small.
> 
> Or do you disagree?

Alan disagrees ;-)  Here's my view FWIW:

There are MANY different cluster/resource management paradigms.  There are
fewer resource monitoring paradigms.  One thing that Mon does well that I
neglected to mention is handling dependencies.  In the long term, one would
want the resource monitoring code to do other things like monitor fan
speeds, temperatures, S.M.A.R.T. informtion from disks, etc. 

At this point it becomes a lot more interesting.

I don't think coupling it to the policy and decision making process is
desirable, or necessary.  What I think you want it to do is spit out
"uplink" messages that say things like:
		resource foo failed
and	resource foo now working (again)

You also have things like ECC RAM errors that need to be filtered -
typically using a leaky bucket filter.  Ultimately, all of these things
ought to be accommodated.  See GoAhead's software for more ideas.  Or see
any solid telecomm switch.  It's old technology - but it works ;-)

I don't want the two joined.

[snip] 
> >
> >       If the monitor operation fails, log it, and ask the cluster manger
> >               to go into standby mode.
> 
> I don't know exactly what you mean by "standby mode", but there is no
> need to move other services away (except those that depend on the failed
> one). I think if the resource management part is rewritten anyway it
> would be a good idea to include the consept of resource groups, like in
> failsafe.

We have resource groups.  We don't have a good resource/cluster manager.

The standby is really a terminology from the current resource/cluster
manager.  It's all it can do.  But, it's a lot better than we have now.  Not
really enough for everyone, but my guess enough for half of the current
heartbeat users.

Ultimately, we need to do better, but that's not going to be solved in the
current cluster/resource manager.  Do standby for now, do something better
with a better resource/cluster manager.

Probably want to base the new resource manager on Ram Pai's group
communication paradigm.

	-- Alan Robertson
	   alanr@unix.sh
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.community.tummy.com
http://lists.community.tummy.com/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic