'Re: [Linux-cluster] Failover root cause'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       redhat-linux-cluster
Subject:    Re: [Linux-cluster] Failover root cause
From:       Muhammad Panji <sumodirjo () gmail ! com>
Date:       2012-11-11 22:49:43
Message-ID: CANbzdH=dMhc0YsTrjztkf-Qb-v8fUSs1uTE9DOgBFmABeyaOQA () mail ! gmail ! com
[Download RAW message or body]

Hi,
I plan to implement NTP so that both servers time synchronized. How
can I look for the failover cause? I already graph sar data and no
peak usage on the time when db1svr was fenced by db2svr. What file
(and what specific message) that I should look to know the root cause
of this failover. Thank you.
Regards,

Panji

On Fri, Nov 9, 2012 at 10:40 AM, Yu <songyu555@gmail.com> wrote:
> Regardless what was the root cause you find. Cluster requires Ntp service to ensure \
> all nodes have time synchronized.  So you have to fix this 5 mins difference now. 
> Regards
> Yu
> 
> On 09/11/2012, at 11:47, Muhammad Panji <sumodirjo@gmail.com> wrote:
> 
> > Dear All,
> > I have an oracle cluster on RHEL 6.2 with 2 servers. Several days ago
> > the service was failover from node1 to node2. From /var/log/messages
> > on node2 I only see this message :
> > 
> > ...
> > Oct 23 12:54:19 db2svr corosync[4142]:   [TOTEM ] A processor failed,
> > forming new configuration.
> > Oct 23 12:54:21 db2svr corosync[4142]:   [QUORUM] Members[1]: 2
> > Oct 23 12:54:21 db2svr corosync[4142]:   [TOTEM ] A processor joined
> > or left the membership and a new membership was formed.
> > Oct 23 12:54:21 db2svr kernel: dlm: closing connection to node 1
> > Oct 23 12:54:21 db2svr rgmanager[5327]: State change: clu1 DOWN
> > Oct 23 12:54:21 db2svr fenced[4193]: fencing node clu1
> > ...
> > 
> > Googling this message " [TOTEM ] A processor failed, forming new
> > configuration." I learned that it means node2 couldn't see node1 and
> > then fence node1. on node1 I get this message :
> > 
> > Oct 23 12:50:45 db1svr rgmanager[75890]: [script] Executing
> > /etc/init.d/httpd status
> > Oct 23 12:56:01 db1svr kernel: imklog 4.6.2, log source = /proc/kmsg started.
> > Oct 23 12:56:01 db1svr rsyslogd: [origin software="rsyslogd"
> > swVersion="4.6.2" x-pid="3792" x-info="http://www.rsyslog.com"]
> > (re)start
> > Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpuset
> > Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpu
> > Oct 23 12:56:01 db1svr kernel: Linux version 2.6.32-220.el6.x86_64
> > (mockbuild@x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214
> > (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011
> > 
> > on 12:50 rgmanager still checking the service and then it's rebooted.
> > Thing that make it worse is that the date / time of both servers are
> > different so that I can't compare the logs directly. Current time
> > difference between both servers is around 5 minutes.
> > 
> > I would like to ask where to look for the cause of this failover? I
> > plan to graph sar data today to see if there were bottleneck on CPU
> > etc so that node1 could not send status to node2, but if no bottleneck
> > on CPU or RAM etc where should I find the root cause of failover?
> > thank you.
> > Regards,
> > 
> > 
> > 
> > 
> > 
> > --
> > Muhammad Panji
> > http://www.panji.web.id
> > http://www.kurungsiku.com
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster@redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Muhammad Panji
http://www.panji.web.id
http://www.kurungsiku.com

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

[prev in list] [next in list] [prev in thread] [next in thread]