[prev in list] [next in list] [prev in thread] [next in thread]
List: linux-ha-dev
Subject: Re: [Linux-ha-dev] Resubmission of the "new" db2 agent with
From: Dejan Muhamedagic <dejanmm () fastmail ! fm>
Date: 2011-03-01 14:31:09
Message-ID: 20110301143108.GA3489 () squib
[Download RAW message or body]
Hi Holger,
On Thu, Feb 24, 2011 at 04:28:49PM +0100, Holger Teutsch wrote:
> Hi Dejan,
>
> On Thu, 2011-02-24 at 11:08 +0100, Dejan Muhamedagic wrote:
> > Hi Holger,
> >
> > On Wed, Feb 23, 2011 at 06:03:21PM +0100, Holger Teutsch wrote:
> > > Hi Dejan,
> > >
> > > On Wed, 2011-02-23 at 11:54 +0100, Dejan Muhamedagic wrote:
> > > > Hi Holger,
> > > >
> > > > On Tue, Feb 22, 2011 at 06:25:37PM +0100, Holger Teutsch wrote:
> > > > > Hi,
> > > > > I resubmit the db2 agent for inclusion into the project. Besides fixing
> ....
>
> > > > > @@ -417,8 +445,12 @@
> > > > > ocf_log err "Possible split brain ! Manual intervention required."
> > > > > ocf_log err "If this DB is outdated use \"db2 start hadr on db $db as \
> > > > > standby\"" ocf_log err "If this DB is the surviving primary use \"db2 start \
> > > > > hadr on db $db as primary by force\""
> > > > > - # should we return OCF_ERR_INSTALLED instead ?
> > > > > - # might be a timing problem
> > > > > +
> > > > > + # might be a timing problem because "First active log" is \
> > > > > delayed + # sleep long so we won't end up in a high speed \
> > > > > retry loop + # lrmd will kill us eventually on timeout
> > > > > + # on the next start attempt we might succeed when FAL was \
> > > > > advanced + sleep 36000
> > > >
> > > > Perhaps you should still remove this sleep. If there's nothing
> > > > that can be done without administrator intervention, then better
> > > > exit soon and let the cluster try to recover whichever way it can
> > > > (depending also on how it is configured).
> > > >
> > >
> > > Yes, but we can end up in a "high speed" restart loop. Instead of
> > > putting in some random sleep I felt that relying on the administrator's
> > > timeout choice is better.
> >
> > Well, the RA should always tell the truth and in this case it
> > gives an impression that there was a timeout even though there
> > wasn't one. What is it actually that should or shouldn't happen
> > at this point? Does it want to say: "I cannot be started anymore
> > on this node"? Is that just a temporary condition? BTW, even if
> > it gets into a restart loop, that cannot make things any worse,
> > right? I can't say really, but somehow doing an artificial
> > timeout doesn't look right.
>
> The scenario is:
> "I am a lonesome Primary and could not connect to my Standby during
> startup within HADR_TIMEOUT seconds"
>
> So multiple causes, multiple possible resolutions...
>
> -> no sleep, return the truth: generic error
OK. Good. Will apply this and previous changes to the new git
repository.
Cheers,
Dejan
> Regards
> Holger
>
>
> ------------------ reference -----------------------
> --- a/db2 Wed Feb 23 18:24:59 2011 +0100
> +++ b/db2 Thu Feb 24 16:15:55 2011 +0100
> @@ -446,11 +446,11 @@
> ocf_log err "If this DB is outdated use \"db2 start hadr on db $db as standby\""
> ocf_log err "If this DB is the surviving primary use \"db2 start hadr on db $db as \
> primary by force\""
> + # might be the Standby is not yet there
> # might be a timing problem because "First active log" is delayed
> - # sleep long so we won't end up in a high speed retry loop
> - # lrmd will kill us eventually on timeout
> - # on the next start attempt we might succeed when FAL was advanced
> - sleep 36000
> + # on the next start attempt we might succeed when FAL was \
> advanced + # might be manual intervention is required
> + # ... so let pacemaker give it another try and we will succeed \
> then return $OCF_ERR_GENERIC
> ;;
>
>
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic