'Re: [Linux-HA] resource keep restarting on standby node'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha
Subject:    Re: [Linux-HA] resource keep restarting on standby node
From:       "Andreas Kurz" <andreas.kurz () gmail ! com>
Date:       2008-07-31 11:31:04
Message-ID: 904050d50807310431l157e003fq748b44e43ce44130 () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (text/plain)]

2008/7/31 jijun gao <gaojijun.bit@gmail.com>:
> hi,
> I  have two nodes, and when I  start service heartbeat only on standby node,
>
> the resource keep on restarting itself, I don't know what is happening.
> *
> below is the resource infomation:*
> [root@node2 ~]# cibadmin -Q -o resources
>  <resources>
>   <group id="group_1">
>     <primitive class="ocf" id="IPaddr_192_168_10_211" provider="heartbeat"
> type="IPaddr">
>       <operations>
>         <op id="IPaddr_192_168_10_211_mon" interval="1s" name="monitor"
> timeout="1s"/>

very short interval and timeout

>       </operations>
>       <instance_attributes id="IPaddr_192_168_10_211_inst_attr">
>         <attributes>
>           <nvpair id="IPaddr_192_168_10_211_attr_0" name="ip" value="
> 192.168.10.211"/>
>         </attributes>
>       </instance_attributes>
>     </primitive>
>     <primitive class="lsb" id="asterisk_2" provider="heartbeat"
> type="asterisk">
>       <operations>
>         <op id="asterisk_2_mon" interval="3s" name="monitor" timeout="2s"/>

see above ...

>       </operations>
>     </primitive>
>   </group>
>  </resources>
>
> *and here is part of the system log:
> *Jul 31 16:24:37 node2 last message repeated 9 times
> Jul 31 16:24:37 node2 setroubleshoot:      SELinux is preventing ifconfig
> (ifconfig_t) "read write" to socket:[136168] (initrc_t).      For complete
> SELinux messages. run sealert -l 0db84664-2bd3-4f8f-a10e-1e0641417484

hmmm ... I'm not familiar with SELinux, but that looks suspicious to
me. I assume on node1 SELinux is disabled?

> Jul 31 16:24:37 node2 lrmd: [29544]: WARN: asterisk_2:monitor process (PID
> 23374) timed out (try 1).  Killing with signal SIGTERM (15).

... and because of the monitoring timeout the resource is declared
dead and restarted.

> Jul 31 16:24:37 node2 lrmd: [29544]: WARN: operation monitor[389] on
> ocf::IPaddr::IPaddr_192_168_10_211 for client 29547, its parameters:
> CRM_meta_interval=[1000] ip=[192.168.10.211]
> CRM_meta_id=[IPaddr_192_168_10_211_mon] CRM_meta_timeout=[1000]
> crm_feature_set=[2.0] CRM_meta_name=[monitor] : pid [23361] timed out
> Jul 31 16:24:37 node2 crmd: [29547]: ERROR: process_lrm_event: LRM operation
> IPaddr_192_168_10_211_monitor_1000 (389) Timed Out (timeout=1000ms)
> Jul 31 16:24:37 node2 tengine: [29549]: info: process_graph_event: Action
> IPaddr_192_168_10_211_monitor_1000 arrived after a completed transition
> Jul 31 16:24:37 node2 tengine: [29549]: info: update_abort_priority: Abort
> priority upgraded to 1000000
> Jul 31 16:24:37 node2 tengine: [29549]: WARN: update_failcount: Updating
> failcount for IPaddr_192_168_10_211 on cddecca4-8275-4913-83a7-8e7d3324cefc
> after failed monitor: rc=-2
> Jul 31 16:24:37 node2 crmd: [29547]: info: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE
> origin=route_message ]
> Jul 31 16:24:37 node2 crmd: [29547]: info: do_state_transition: All 1
> cluster nodes are eligible to run resources.
> Jul 31 16:24:37 node2 pengine: [29550]: info: determine_online_status: Node
> node2 is online
> Jul 31 16:24:37 node2 pengine: [29550]: WARN: unpack_rsc_op: Processing
> failed op IPaddr_192_168_10_211_monitor_1000 on node2: Timed Out
> Jul 31 16:24:37 node2 pengine: [29550]: notice: group_print: Resource Group:
> group_1
> Jul 31 16:24:37 node2 pengine: [29550]: notice: native_print:
> IPaddr_192_168_10_211 (heartbeat::ocf:IPaddr):        Started node2 FAILED
> Jul 31 16:24:37 node2 pengine: [29550]: notice: native_print:
> asterisk_2    (lsb:asterisk): Started node2
> Jul 31 16:24:37 node2 pengine: [29550]: notice: NoRoleChange: Recover
> resource IPaddr_192_168_10_211    (node2)
>
> Jul 31 16:24:37 node2 pengine: [29550]: notice: StopRsc:   node2        Stop
> IPaddr_192_168_10_211
> Jul 31 16:24:37 node2 pengine: [29550]: notice: StartRsc:  node2
> Start IPaddr_192_168_10_211
> Jul 31 16:24:37 node2 pengine: [29550]: notice: RecurringOp: node2
> IPaddr_192_168_10_211_monitor_1000
> Jul 31 16:24:37 node2 pengine: [29550]: notice: NoRoleChange: Leave resource
> asterisk_2 (node2)
> Jul 31 16:24:37 node2 crmd: [29547]: info: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=route_message ]
> Jul 31 16:24:37 node2 lrmd: [29544]: WARN: operation monitor[391] on
> lsb::asterisk::asterisk_2 for client 29547, its parameters:
> CRM_meta_interval=[3000] CRM_meta_id=[asterisk_2_mon]
> CRM_meta_timeout=[2000] crm_feature_set=[2.0] CRM_meta_name=[monitor] : pid
> [23374] timed out
> Jul 31 16:24:37 node2 pengine: [29550]: info: process_pe_message: Transition
> 83: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-384.bz2
> Jul 31 16:24:37 node2 tengine: [29549]: info: unpack_graph: Unpacked
> transition 83: 11 actions in 11 synapses
> Jul 31 16:24:37 node2 crmd: [29547]: ERROR: process_lrm_event: LRM operation
> asterisk_2_monitor_3000 (391) Timed Out (timeout=2000ms)
> Jul 31 16:24:37 node2 tengine: [29549]: info: te_pseudo_action: Pseudo
> action 12 fired and confirmed
> Jul 31 16:24:37 node2 tengine: [29549]: info: send_rsc_command: Initiating
> action 8: asterisk_2_stop_0 on node2
> Jul 31 16:24:37 node2 tengine: [29549]: info: process_graph_event: Detected
> action asterisk_2_monitor_3000 from a different transition: 82 vs. 83
> Jul 31 16:24:37 node2 crmd: [29547]: info: do_lrm_rsc_op: Performing
> op=asterisk_2_stop_0 key=8:83:7f6166d0-2099-450e-81e0-3900a25ae8fd)
> Jul 31 16:24:38 node2 tengine: [29549]: info: update_abort_priority: Abort
> priority upgraded to 1000000
> Jul 31 16:24:38 node2 lrmd: [29544]: info: rsc:asterisk_2: stop
> Jul 31 16:24:38 node2 tengine: [29549]: info: update_abort_priority: Abort
> action 0 superceeded by 2
> Jul 31 16:24:38 node2 lrmd: [23380]: WARN: For LSB init script, no
> additional parameters are needed.
>
> Jul 31 16:24:38 node2 crmd: [29547]: info: process_lrm_event: LRM operation
> asterisk_2_monitor_3000 (call=391, rc=-2) Cancelled
> Jul 31 16:24:38 node2 tengine: [29549]: WARN: update_failcount: Updating
> failcount for asterisk_2 on cddecca4-8275-4913-83a7-8e7d3324cefc after
> failed monitor: rc=-2
> Jul 31 16:24:38 node2 lrmd: [29544]: info: RA output:
> (asterisk_2:stop:stdout) Shutting down asterisk:
> Jul 31 16:24:38 node2 lrmd: [29544]: info: RA output:
> (asterisk_2:stop:stdout) [
> Jul 31 16:24:38 node2 lrmd: [29544]: info: RA output:
> (asterisk_2:stop:stdout) 确定
> Jul 31 16:24:38 node2 lrmd: [29544]: info: RA output:
> (asterisk_2:stop:stdout) ]^M
> Jul 31 16:24:38 node2 lrmd: [29544]: info: RA output:
> (asterisk_2:stop:stdout)
> Jul 31 16:24:38 node2 crmd: [29547]: info: process_lrm_event: LRM operation
> asterisk_2_stop_0 (call=392, rc=0) complete
> Jul 31 16:24:38 node2 tengine: [29549]: info: match_graph_event: Action
> asterisk_2_stop_0 (8) confirmed on node2 (rc=0)
> Jul 31 16:24:38 node2 tengine: [29549]: info: run_graph:
> ====================================================
> Jul 31 16:24:38 node2 tengine: [29549]: notice: run_graph: Transition 83:
> (Complete=2, Pending=0, Fired=0, Skipped=9, Incomplete=0)
> Jul 31 16:24:38 node2 crmd: [29547]: info: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_IPC_MESSAGE origin=route_message ]
> Jul 31 16:24:38 node2 crmd: [29547]: info: do_state_transition: All 1
> cluster nodes are eligible to run resources.
> Jul 31 16:24:38 node2 pengine: [29550]: info: determine_online_status: Node
> node2 is online
> Jul 31 16:24:38 node2 pengine: [29550]: WARN: unpack_rsc_op: Processing
> failed op IPaddr_192_168_10_211_monitor_1000 on node2: Timed Out
> Jul 31 16:24:38 node2 pengine: [29550]: notice: group_print: Resource Group:
> group_1
> Jul 31 16:24:38 node2 pengine: [29550]: notice: native_print:
> IPaddr_192_168_10_211 (heartbeat::ocf:IPaddr):        Started node2 FAILED
> Jul 31 16:24:38 node2 pengine: [29550]: notice: native_print:
> asterisk_2    (lsb:asterisk): Stopped
> Jul 31 16:24:38 node2 pengine: [29550]: notice: NoRoleChange: Recover
> resource IPaddr_192_168_10_211    (node2)
> Jul 31 16:24:38 node2 pengine: [29550]: notice: StopRsc:   node2        Stop
> IPaddr_192_168_10_211
>
> when I don't start heartbeat, but start service asterisk alone, it works
> fine.
> and when I start heartbeat on the primary node, it works fine too.
> thanks for reading so long a letter.  any ideas?
>
> *
> *
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[prev in list] [next in list] [prev in thread] [next in thread]