'RE: [Veritas-ha] VCS 1.3 - RestartLimit & OnlineRetryLimit'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       veritas-ha
Subject:    RE: [Veritas-ha] VCS 1.3 - RestartLimit & OnlineRetryLimit
From:       Gene Henriksen <gene.henriksen () veritas ! com>
Date:       2002-03-25 10:54:42
[Download RAW message or body]

The only two attributes you should need to set are

RestartLimit (how many times the agent will attempt restart before failing
over)
ConfInterval (how long the resource will have to run without faulting before
the internal Restart counter is reset to zero (default is usually 600
seconds).

The OnlineRetryLimit is used to determine whether to attempt to retry the
online operation if the resource fails to come online.

From the 1.3 docs:
When the agent determines that the resource is faulted, it calls the clean
entry point, if
implemented. This is done to verify that the resource is completely offline.
The next
monitor after clean confirms the offline. The agent then tries to online the
resource again
if RestartLimit is non-zero. The agent attempts to restart the resource
according to the
number set in RestartLimit before it gives up and informs the VCS engine
that the
resource is faulted. However, if the resource remains online for the
interval designated in
ConfInterval, earlier attempts to restart are not counted against
RestartLimit.

-----Original Message-----
From: Peter Gurney Wickett [mailto:peter.gurney.wickett@europe.eds.com]
Sent: Monday, March 25, 2002 3:39 AM
To: veritas-ha@mailman.eng.auburn.edu
Subject: [Veritas-ha] VCS 1.3 - RestartLimit & OnlineRetryLimit

Hi everyone,

We have a two node vcs 1.3 cluster running 4 instances of Oracle on each
node - all in their own service groups.

The resource type Oracle is critical in all instances and the required
behaviour is that:

1) If an instance goes down on node A
2) vcs attempts to restart that instance on node A.
3) If this restart fails.
4) The service group of the instance in question fails over to node B.

This we have achieved by:

1) Setting each instance of the oracle resource type to be critical.
2) setting OnlineRetryLimit = 1 for resource type oracle.
3) setting RestartLimit = 1 for resource type oracle.

My understanding is that setting the RestartLimit to a non zero value means
that vcs will attempt to restart the resource type before failing over
and that OnlineRetryLimit determines the number of times that the restart
will be attempted.

The problem we have is with the following scenario:

One instance goes down due to sql error (or whatever), it successfully
restarts.  The online retry limit for this instance has now reached one.
If oracle goes down immediately again we want it to failover - fine, it
will in this situation.  However, if three months pass and the same
instance goes down for whatever reason the entire service group will fail
over to the other node because the the OnlineRestartLimit has been reached
for that resource type due to the incident three months before.  This is
not what we want.

My question is - how do we reset the 'online restart counter' (or whatever
it is called) for a given instance of a resource type?

All the best
Peter Wickett
Sistemes Unix
EDS Barcelona

_______________________________________________
Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
_______________________________________________
Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
[prev in list] [next in list] [prev in thread] [next in thread]