'Re: [Openais] DRBD+pacemaker slave not coming back online after'

[prev in list] [next in list] [prev in thread] [next in thread] 

List: openais
Subject: Re: [Openais] DRBD+pacemaker slave not coming back online after
From: Mark Steele <msteele () beringmedia ! com>
Date: 2010-01-29 21:54:34
Message-ID: aa3794b1001291354i16b61144kd8c2226af7e1e23a () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

Hi Andrew,

I fixed it by removing the on-fail=standby and using
migration-threshold="1". It now behaves as we expect it to.

Unfortunately I've since re-imaged the test boxes, so no dice for hb_report.

If you do want to try to reproduce, I was running on Gentoo and installed
everything from source:

Corosync 1.1.2
openais 1.1.0
Cluster-Resource-Agents-4ac8bf7a64fe
Pacemaker-1-0-6695fd350a64
Reusable-Cluster-Components-2905a7843039
DRBD 8.3.6

Cheers,

Mark

On Fri, Jan 29, 2010 at 3:32 AM, Andrew Beekhof <andrew@beekhof.net> wrote:

> On Wed, Jan 27, 2010 at 12:45 AM, Mark Steele <msteele@beringmedia.com>
> wrote:
> > Hi folks,
> >
> > I've got a pacemaker cluster setup as follows:
> >
> > # crm configure
> > property no-quorum-policy=ignore
> > property stonith-enabled="true"
> >
> > primitive drbd_qs ocf:linbit:drbd params drbd_resource="r0" op monitor
> > interval="15s" op stop timeout00s op start timeout00s
> >
> >
> > ms ms_drbd_qs drbd_qs meta master-max="1" master-node-max="1"
> clone-max="2"
> > clone-node-max="1" notify="true"
> >
> > primitive qs_fs ocf:heartbeat:Filesystem params
> device="/dev/drbd/by-res/r0"
> > directory="/mnt/drbd1/" fstype="ext4"
> > options="barrier=0,noatime,nouser_xattr,data=writeback" op monitor
> > interval0s OCF_CHECK_LEVEL  on-fail=standby meta failure-timeout="60"
> >
> >
> > primitive qs_ip ocf:heartbeat:IPaddr2 params ip="172.16.10.155"
> nic="eth0:0"
> > op monitor interval`s on-fail=standby meta failure-timeout="60"
> > primitive qs_apache2 ocf:bering:apache2 op monitor interval0s
> > on-fail=standby meta failure-timeout="60"
> >
> >
> > primitive qs_rabbitmq ocf:bering:rabbitmq op start timeout0s op stop
> > timeout00s op monitor interval`s on-fail=standby meta
> > failure-timeout="60"
> > group queryserver qs_fs qs_ip qs_apache2 qs_rabbitmq
> >
> >
> >
> > primitive qs1-stonith stonith:external/ipmi params hostname=qs1
> > ipaddr2.16.10.134 userid=root passwd=blah interface=lan op start
> > interval=0s timeout s requires=nothing op monitor interval`0s
> > timeout s requires=nothing
> >
> >
> > primitive qs2-stonith stonith:external/ipmi params hostname=qs2
> > ipaddr2.16.10.133 userid=root passwd=blah interface=lan op start
> > interval=0s timeout s requires=nothing op monitor interval`0s
> > timeout s requires=nothing
> >
> >
> >
> > location l-st-qs1 qs1-stonith -inf: qs1
> > location l-st-qs2 qs2-stonith -inf: qs2
> > colocation queryserver_on_drbd inf: queryserver ms_drbd_qs:Master
> >
> > order queryserver_after_drbd inf: ms_drbd_qs:promote queryserver:start
> >
> >
> > order ip_after_fs inf: qs_fs qs_ip
> > order apache_after_ip inf: qs_ip qs_apache2
> > order rabbitmq_after_ip inf: qs_ip qs_rabbitmq
> >
> > verify
> > commit
> >
> > Under normal operations, this is what I expect the cluster to look like:
> >
> >
> >
> >
> > # crm status
> > ===========> > Last updated: Tue Jan 26 11:55:50 2010
> > Current DC: qs1 - partition with quorum
> >
> >
> > 2 Nodes configured, 2 expected votes
> > 4 Resources configured.
> > ===========> >
> > Online: [ qs1 qs2 ]
> >
> >
> >  Master/Slave Set: ms_drbd_qs
> >
> >      Masters: [ qs1 ]
> >      Slaves: [ qs2 ]
> >  qs1-stonith    (stonith:external/ipmi):        Started qs2
> >  qs2-stonith    (stonith:external/ipmi):        Started qs1
> >  Resource Group: queryserver
> >
> >
> >      qs_fs      (ocf::heartbeat:Filesystem):    Started qs1
> >      qs_ip      (ocf::heartbeat:IPaddr2):       Started qs1
> >      qs_apache2 (ocf::bering:apache2):  Started qs1
> >      qs_rabbitmq        (ocf::bering:rabbitmq): Started qs1
> >
> >
> >
> > If however a failure occurs, my configuration instructs pacemaker to put
> the
> > node in which the failure occurs into standby for 60 seconds:
> >
> >
> >
> > # killall -9 rabbit
> >
> > # crm status
> > ===========> > Last updated: Tue Jan 26 11:55:56 2010
> > Current DC: qs1 - partition with quorum
> >
> >
> > 2 Nodes configured, 2 expected votes
> > 4 Resources configured.
> > ===========> >
> > Node qs1: standby (on-fail)
> >
> >
> > Online: [ qs2 ]
> >
> >  Master/Slave Set: ms_drbd_qs
> >      Masters: [ qs1 ]
> >      Slaves: [ qs2 ]
> >  qs1-stonith    (stonith:external/ipmi):        Started qs2
> >  qs2-stonith    (stonith:external/ipmi):        Started qs1
> >
> >
> >  Resource Group: queryserver
> >      qs_fs      (ocf::heartbeat:Filesystem):    Started qs1
> >      qs_ip      (ocf::heartbeat:IPaddr2):       Started qs1
> >      qs_apache2 (ocf::bering:apache2):  Started qs1
> >
> >
> >      qs_rabbitmq        (ocf::bering:rabbitmq): Started qs1 FAILED
> >
> > Failed actions:
> >     qs_rabbitmq_monitor_60000 (node=qs1, call2, rc=7, status=complete):
> > not running
>
> This looks like the first problem.
> There shouldn't be anything running on qs1 at this point.
>
> Can you attach a hb_report archive for the interval covered by this test?
> That will contain everything I need to diagnose the problem.
>
> > After the 60 second timeout, I would expect the node to come back online,
> > and DRBD replication to resume, alas this is what I get:
> >
> >
> >
> >
> > # crm status
> > ===========> > Last updated: Tue Jan 26 11:58:36 2010
> > Current DC: qs1 - partition with quorum
> > 2 Nodes configured, 2 expected votes
> > 4 Resources configured.
> > ===========> >
> > Online: [ qs1 qs2 ]
> >
> >
> >
> >  Master/Slave Set: ms_drbd_qs
> >      Masters: [ qs2 ]
> >      Stopped: [ drbd_qs:0 ]
> >  qs1-stonith    (stonith:external/ipmi):        Started qs2
> >  Resource Group: queryserver
> >      qs_fs      (ocf::heartbeat:Filesystem):    Started qs2
> >
> >
> >      qs_ip      (ocf::heartbeat:IPaddr2):       Started qs2
> >      qs_apache2 (ocf::bering:apache2):  Started qs2
> >      qs_rabbitmq        (ocf::bering:rabbitmq): Started qs2
> >
> > DRBD fail-over works properly under certain conditions (eg: if I stop
> > corosync, powercycle the box, manual standby fail-over), however in the
> case
> > described above (one of the monitored services gets killed) leads to the
> > undesirable state of DRBD no longer replicating.
> >
> > Does anyone have some ideas on what would need to be changed in the
> > pacemaker/corosync configuration for the node to come back online and go
> > from stopped to slave state?
>
> You could be hitting a bug - I'll know more when you attach the hb_report.
>

[Attachment #5 (text/html)]

Hi Andrew, I fixed it by removing the on-fail=standby and using \
migration-threshold=&quot;1&quot;. It now behaves as we expect it \
to. Unfortunately I&#39;ve since re-imaged the test boxes, so no dice for \
hb_report.

<br>If you do want to try to reproduce, I was running on Gentoo and installed \
everything from source:<br><br>Corosync 1.1.2<br>openais \
1.1.0<br>Cluster-Resource-Agents-4ac8bf7a64fe<br>Pacemaker-1-0-6695fd350a64<br>Reusable-Cluster-Components-2905a7843039<br>

DRBD 8.3.6<br><br>Cheers,<br><br>Mark<br><br><br><div class="gmail_quote">On Fri, Jan \
29, 2010 at 3:32 AM, Andrew Beekhof <span dir="ltr">&lt;<a \
href="mailto:andrew@beekhof.net">andrew@beekhof.net</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); \
margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div></div><div class="h5">On \
Wed, Jan 27, 2010 at 12:45 AM, Mark Steele &lt;<a \
href="mailto:msteele@beringmedia.com">msteele@beringmedia.com</a>&gt; wrote:<br>

&gt; Hi folks,<br>
&gt;<br>
&gt; I&#39;ve got a pacemaker cluster setup as follows:<br>
&gt;<br>
&gt; # crm configure<br>
&gt; property no-quorum-policy=ignore<br>
&gt; property stonith-enabled=&quot;true&quot;<br>
&gt;<br>
&gt; primitive drbd_qs ocf:linbit:drbd params drbd_resource=&quot;r0&quot; op \
monitor<br> &gt; interval=&quot;15s&quot; op stop timeout=300s op start \
timeout=300s<br> &gt;<br>
&gt;<br>
&gt; ms ms_drbd_qs drbd_qs meta master-max=&quot;1&quot; \
master-node-max=&quot;1&quot; clone-max=&quot;2&quot;<br> &gt; \
clone-node-max=&quot;1&quot; notify=&quot;true&quot;<br> &gt;<br>
&gt; primitive qs_fs ocf:heartbeat:Filesystem params \
device=&quot;/dev/drbd/by-res/r0&quot;<br> &gt; directory=&quot;/mnt/drbd1/&quot; \
fstype=&quot;ext4&quot;<br> &gt; \
options=&quot;barrier=0,noatime,nouser_xattr,data=writeback&quot; op monitor<br> &gt; \
interval=30s OCF_CHECK_LEVEL=20 on-fail=standby meta \
failure-timeout=&quot;60&quot;<br> &gt;<br>
&gt;<br>
&gt; primitive qs_ip ocf:heartbeat:IPaddr2 params ip=&quot;172.16.10.155&quot; \
nic=&quot;eth0:0&quot;<br> &gt; op monitor interval=60s on-fail=standby meta \
failure-timeout=&quot;60&quot;<br> &gt; primitive qs_apache2 ocf:bering:apache2 op \
monitor interval=30s<br> &gt; on-fail=standby meta failure-timeout=&quot;60&quot;<br>
&gt;<br>
&gt;<br>
&gt; primitive qs_rabbitmq ocf:bering:rabbitmq op start timeout=120s op stop<br>
&gt; timeout=300s op monitor interval=60s on-fail=standby meta<br>
&gt; failure-timeout=&quot;60&quot;<br>
&gt; group queryserver qs_fs qs_ip qs_apache2 qs_rabbitmq<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; primitive qs1-stonith stonith:external/ipmi params hostname=qs1<br>
&gt; ipaddr=172.16.10.134 userid=root passwd=blah interface=lan op start<br>
&gt; interval=0s timeout=20s requires=nothing op monitor interval=600s<br>
&gt; timeout=20s requires=nothing<br>
&gt;<br>
&gt;<br>
&gt; primitive qs2-stonith stonith:external/ipmi params hostname=qs2<br>
&gt; ipaddr=172.16.10.133 userid=root passwd=blah interface=lan op start<br>
&gt; interval=0s timeout=20s requires=nothing op monitor interval=600s<br>
&gt; timeout=20s requires=nothing<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; location l-st-qs1 qs1-stonith -inf: qs1<br>
&gt; location l-st-qs2 qs2-stonith -inf: qs2<br>
&gt; colocation queryserver_on_drbd inf: queryserver ms_drbd_qs:Master<br>
&gt;<br>
&gt; order queryserver_after_drbd inf: ms_drbd_qs:promote queryserver:start<br>
&gt;<br>
&gt;<br>
&gt; order ip_after_fs inf: qs_fs qs_ip<br>
&gt; order apache_after_ip inf: qs_ip qs_apache2<br>
&gt; order rabbitmq_after_ip inf: qs_ip qs_rabbitmq<br>
&gt;<br>
&gt; verify<br>
&gt; commit<br>
&gt;<br>
&gt; Under normal operations, this is what I expect the cluster to look like:<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; # crm status<br>
&gt; ============<br>
&gt; Last updated: Tue Jan 26 11:55:50 2010<br>
&gt; Current DC: qs1 - partition with quorum<br>
&gt;<br>
&gt;<br>
&gt; 2 Nodes configured, 2 expected votes<br>
&gt; 4 Resources configured.<br>
&gt; ============<br>
&gt;<br>
&gt; Online: [ qs1 qs2 ]<br>
&gt;<br>
&gt;<br>
&gt;  Master/Slave Set: ms_drbd_qs<br>
&gt;<br>
&gt;      Masters: [ qs1 ]<br>
&gt;      Slaves: [ qs2 ]<br>
&gt;  qs1-stonith    (stonith:external/ipmi):        Started qs2<br>
&gt;  qs2-stonith    (stonith:external/ipmi):        Started qs1<br>
&gt;  Resource Group: queryserver<br>
&gt;<br>
&gt;<br>
&gt;      qs_fs      (ocf::heartbeat:Filesystem):    Started qs1<br>
&gt;      qs_ip      (ocf::heartbeat:IPaddr2):       Started qs1<br>
&gt;      qs_apache2 (ocf::bering:apache2):  Started qs1<br>
&gt;      qs_rabbitmq        (ocf::bering:rabbitmq): Started qs1<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; If however a failure occurs, my configuration instructs pacemaker to put the<br>
&gt; node in which the failure occurs into standby for 60 seconds:<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; # killall -9 rabbit<br>
&gt;<br>
&gt; # crm status<br>
&gt; ============<br>
&gt; Last updated: Tue Jan 26 11:55:56 2010<br>
&gt; Current DC: qs1 - partition with quorum<br>
&gt;<br>
&gt;<br>
&gt; 2 Nodes configured, 2 expected votes<br>
&gt; 4 Resources configured.<br>
&gt; ============<br>
&gt;<br>
&gt; Node qs1: standby (on-fail)<br>
&gt;<br>
&gt;<br>
&gt; Online: [ qs2 ]<br>
&gt;<br>
&gt;  Master/Slave Set: ms_drbd_qs<br>
&gt;      Masters: [ qs1 ]<br>
&gt;      Slaves: [ qs2 ]<br>
&gt;  qs1-stonith    (stonith:external/ipmi):        Started qs2<br>
&gt;  qs2-stonith    (stonith:external/ipmi):        Started qs1<br>
&gt;<br>
&gt;<br>
&gt;  Resource Group: queryserver<br>
&gt;      qs_fs      (ocf::heartbeat:Filesystem):    Started qs1<br>
&gt;      qs_ip      (ocf::heartbeat:IPaddr2):       Started qs1<br>
&gt;      qs_apache2 (ocf::bering:apache2):  Started qs1<br>
&gt;<br>
&gt;<br>
&gt;      qs_rabbitmq        (ocf::bering:rabbitmq): Started qs1 FAILED<br>
&gt;<br>
&gt; Failed actions:<br>
&gt;     qs_rabbitmq_monitor_60000 (node=qs1, call=32, rc=7, status=complete):<br>
&gt; not running<br>
<br>
</div></div>This looks like the first problem.<br>
There shouldn&#39;t be anything running on qs1 at this point.<br>
<br>
Can you attach a hb_report archive for the interval covered by this test?<br>
That will contain everything I need to diagnose the problem.<br>
<div class="im"><br>
&gt; After the 60 second timeout, I would expect the node to come back online,<br>
&gt; and DRBD replication to resume, alas this is what I get:<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; # crm status<br>
&gt; ============<br>
&gt; Last updated: Tue Jan 26 11:58:36 2010<br>
&gt; Current DC: qs1 - partition with quorum<br>
&gt; 2 Nodes configured, 2 expected votes<br>
&gt; 4 Resources configured.<br>
&gt; ============<br>
&gt;<br>
&gt; Online: [ qs1 qs2 ]<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;  Master/Slave Set: ms_drbd_qs<br>
&gt;      Masters: [ qs2 ]<br>
&gt;      Stopped: [ drbd_qs:0 ]<br>
&gt;  qs1-stonith    (stonith:external/ipmi):        Started qs2<br>
&gt;  Resource Group: queryserver<br>
&gt;      qs_fs      (ocf::heartbeat:Filesystem):    Started qs2<br>
&gt;<br>
&gt;<br>
&gt;      qs_ip      (ocf::heartbeat:IPaddr2):       Started qs2<br>
&gt;      qs_apache2 (ocf::bering:apache2):  Started qs2<br>
&gt;      qs_rabbitmq        (ocf::bering:rabbitmq): Started qs2<br>
&gt;<br>
&gt; DRBD fail-over works properly under certain conditions (eg: if I stop<br>
&gt; corosync, powercycle the box, manual standby fail-over), however in the case<br>
&gt; described above (one of the monitored services gets killed) leads to the<br>
&gt; undesirable state of DRBD no longer replicating.<br>
&gt;<br>
&gt; Does anyone have some ideas on what would need to be changed in the<br>
&gt; pacemaker/corosync configuration for the node to come back online and go<br>
&gt; from stopped to slave state?<br>
<br>
</div>You could be hitting a bug - I&#39;ll know more when you attach the \
hb_report.<br> </blockquote></div><br>

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[prev in list] [next in list] [prev in thread] [next in thread]