'Re: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ocfs2-users
Subject:    Re: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.
From:       Kristiansen Morten <Morten.Kristiansen () hn-ikt ! no>
Date:       2013-04-12 8:18:13
Message-ID: BFE7BFFCEF8B8A4DAEEED5DD1348F388127788AC03 () EXCH02 ! hn ! helsenord ! no
[Download RAW message or body]

All nodes where powered down cleanly. Database was stopped and Grid Infrastructure \
was shut down manualy, but the ocfs2 cluster was not stopped manualy. Nothing \
happened when the nodes was shut down first time. After the first planned reboot, \
node 1,2,3 and 7 seemed to be OK, but the sysadmins had to look into node 4,5 and 6, \
due to disk problems. Grid Infrastructure was started on node 1,2 and 3, but it \
wouldn't start on node 7. The dba checked that the node had disks, but not that disks \
was in proper order, meaning that disk02 really was disk02, etc. The dba thought a \
reboot would probably fix it. So he disabled the grid infrastructure, did nothing to \
ocfs2 and rebooted the server. And that seemed to reboot all other nodes as well.

After the second reboot which was uncontrolled, the grid infrastructure was started \
one by one on node 1, 2 and 3. At 03:00 am the three nodes was running the database. \
At 04:06 am the cluster went down again unplanned. Nobody know why, but the sysadm \
guys said they saw some kernel panic. And in the /var/log/messages the "Kernel Bug at \
...shran/BUILD/ocfs2-1.4.7..." came again as it did at 02:25 am.

Then when the nodes came up again, the database was started on node 4, 5 and 6. It \
wasn't possible to start crs on node 1, 2, 3 and 7. The sysadmins did something with \
those nodes and after another reboot of just those nodes, crs was able to start \
again. So the instances on node 1, 2 and 3 was started, but we didn't start anything \
on node 7 because we were afraid of shutting down the cluster again.

Got a mail from Sunil sayin I had to "ping Oracle". So I guess I'll do that.

Morten K. 
Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 
Kvalitet  - Trygghet - Respekt

-----Original Message-----
From: Joel Becker [mailto:jlbec@ftp.linux.org.uk] On Behalf Of Joel Becker
Sent: 11. april 2013 21:04
To: Kristiansen Morten
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Shutting down one node caused all the other nodes to \
shutdown aswell.

Did you power down nodes uncleanly?  The message says that one node lost track of who \
was doing a particular recovery.  If nodes are shut down cleanly, they should be \
communicating that information.

Joel

On Thu, Apr 11, 2013 at 12:10:22PM +0200, Kristiansen Morten wrote:
> I've had no response on my problem, is there anybody who can help me on this?
> 
> Morten K.
> 
> Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Kvalitet  - Trygghet - 
> Respekt
> 
> 
> 
> From: ocfs2-users-bounces@oss.oracle.com 
> [mailto:ocfs2-users-bounces@oss.oracle.com] On Behalf Of Kristiansen 
> Morten
> Sent: 21. mars 2013 14:47
> To: ocfs2-users@oss.oracle.com
> Subject: [Ocfs2-users] Shutting down one node caused all the other nodes to \
> shutdown aswell. 
> Hi,
> 
> We are running a 8 nodes cluster on RHEL 2.6.18-128 64-bit. Yesterday the \
> server/san guys exchanged the ocfs2 disks to another SAN, by mirroring and \
> synchronizing the disks. When they rebooted the servers, one of the nodes, \
> tos-dipsprod-07 wasn't able to start Oracle Grid Infrastructure, the voting disk \
> was not found. Then we tried to reboot that node, causing all nodes to reboot. Time \
> round about 02:25. When examine the /var/log/messages I discovered a BUG message on \
> one of the node that rebooted unexpectedly, tos-dipsprod-02. I've tried to google \
> it, but I couldn't find any solution. Is this a well known bug? Does any body have \
> a solution to this problem? 
> Below is a extract of o2net and ocfs2 messages from the /var/log/message file.
> 
> /var/log/messages til tos-dipsprod-07:
> Mar 21 02:08:49 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 \
>                 (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, \
>                 shutting it down.
> Mar 21 02:25:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 \
>                 (num 0) at 192.168.7.100:7777 has been idle for 10.0 seconds, \
>                 shutting it down.
> Mar 21 02:25:35 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 \
>                 (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, \
>                 shutting it down.
> Mar 21 02:25:40 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 \
>                 (num 2) at 192.168.7.102:7777 has been idle for 10.0 seconds, \
>                 shutting it down.
> Mar 21 02:25:45 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 \
>                 (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, \
>                 shutting it down.
> Mar 21 02:25:54 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-04 \
>                 (num 5) at 192.168.7.103:7777 has been idle for 10.0 seconds, \
>                 shutting it down.
> Mar 21 04:03:17 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 \
>                 (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, \
>                 shutting it down.
> Mar 21 04:06:32 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 \
>                 (num 0) at 192.168.7.100:7777 has been idle for 10.0 seconds, \
>                 shutting it down.
> Mar 21 04:06:37 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 \
>                 (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, \
>                 shutting it down.
> Mar 21 04:06:47 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 \
>                 (num 2) at 192.168.7.102:7777 has been idle for 10.0 seconds, \
>                 shutting it down.
> Mar 21 06:04:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 \
> (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, shutting it down. 
> Og her fra tos-dipsprod-02:
> 10474-Mar 21 02:25:15 tos-dipsprod-02 kernel: 
> (o2net,7452,5):dlm_begin_reco_handler:2730 
> 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 7, node 
> 3 changing it to 7 10646-Mar 21 02:25:25 tos-dipsprod-02 kernel: 
> (o2net,7452,5):dlm_finalize_reco_handler:2839 ERROR: node 6 sent 
> recovery finalize msg, but node 3 is supposed to be the new master, 
> dead=7 10826:Mar 21 02:25:25 tos-dipsprod-02 kernel: Kernel BUG at 
> ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840
> 10939-Mar 21 02:43:01 tos-dipsprod-02 syslogd 1.4.1: restart.
> 10995-Mar 21 02:43:02 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not \
>                 found.
> --
> 17537-Mar 21 04:06:19 tos-dipsprod-02 kernel: 
> (o2net,7472,1):dlm_begin_reco_handler:2730 
> 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 6, node 
> 6 changing it to 7 17709-Mar 21 04:06:29 tos-dipsprod-02 kernel: 
> (o2net,7472,1):dlm_finalize_reco_handler:2839 ERROR: node 6 sent 
> recovery finalize msg, but node 255 is supposed to be the new master, 
> dead=7 17891:Mar 21 04:06:29 tos-dipsprod-02 kernel: Kernel BUG at 
> ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840
> 18004-Mar 21 04:38:04 tos-dipsprod-02 syslogd 1.4.1: restart.
> 18060-Mar 21 04:41:33 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not \
> found. 
> 
> Morten Kristiansen    | Counsellor
> Helse Nord IKT         | Departement of Serviceproduction
> 
> Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Office address:  Amtmann 
> Worsųes gate 63, 8012 Bodų, Norway Quality  - Safety - Respect
> 
> 
> 
> 

> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users

-- 

"Against stupidity the Gods themselves contend in vain."
	- Friedrich von Schiller

			http://www.jlbec.org/
			jlbec@evilplan.org

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

[prev in list] [next in list] [prev in thread] [next in thread]