[prev in list] [next in list] [prev in thread] [next in thread]
List: ocfs2-users
Subject: Re: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.
From: Kristiansen Morten <Morten.Kristiansen () hn-ikt ! no>
Date: 2013-04-12 8:18:13
Message-ID: BFE7BFFCEF8B8A4DAEEED5DD1348F388127788AC03 () EXCH02 ! hn ! helsenord ! no
[Download RAW message or body]
All nodes where powered down cleanly. Database was stopped and Grid Infrastructure \
was shut down manualy, but the ocfs2 cluster was not stopped manualy. Nothing \
happened when the nodes was shut down first time. After the first planned reboot, \
node 1,2,3 and 7 seemed to be OK, but the sysadmins had to look into node 4,5 and 6, \
due to disk problems. Grid Infrastructure was started on node 1,2 and 3, but it \
wouldn't start on node 7. The dba checked that the node had disks, but not that disks \
was in proper order, meaning that disk02 really was disk02, etc. The dba thought a \
reboot would probably fix it. So he disabled the grid infrastructure, did nothing to \
ocfs2 and rebooted the server. And that seemed to reboot all other nodes as well.
After the second reboot which was uncontrolled, the grid infrastructure was started \
one by one on node 1, 2 and 3. At 03:00 am the three nodes was running the database. \
At 04:06 am the cluster went down again unplanned. Nobody know why, but the sysadm \
guys said they saw some kernel panic. And in the /var/log/messages the "Kernel Bug at \
...shran/BUILD/ocfs2-1.4.7..." came again as it did at 02:25 am.
Then when the nodes came up again, the database was started on node 4, 5 and 6. It \
wasn't possible to start crs on node 1, 2, 3 and 7. The sysadmins did something with \
those nodes and after another reboot of just those nodes, crs was able to start \
again. So the instances on node 1, 2 and 3 was started, but we didn't start anything \
on node 7 because we were afraid of shutting down the cluster again.
Got a mail from Sunil sayin I had to "ping Oracle". So I guess I'll do that.
Morten K.
Tlf: +47 76 16 61 81 | Mob: +47 906 52 903
Kvalitet - Trygghet - Respekt
-----Original Message-----
From: Joel Becker [mailto:jlbec@ftp.linux.org.uk] On Behalf Of Joel Becker
Sent: 11. april 2013 21:04
To: Kristiansen Morten
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Shutting down one node caused all the other nodes to \
shutdown aswell.
Did you power down nodes uncleanly? The message says that one node lost track of who \
was doing a particular recovery. If nodes are shut down cleanly, they should be \
communicating that information.
Joel
On Thu, Apr 11, 2013 at 12:10:22PM +0200, Kristiansen Morten wrote:
> I've had no response on my problem, is there anybody who can help me on this?
>
> Morten K.
>
> Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Kvalitet - Trygghet -
> Respekt
>
>
>
> From: ocfs2-users-bounces@oss.oracle.com
> [mailto:ocfs2-users-bounces@oss.oracle.com] On Behalf Of Kristiansen
> Morten
> Sent: 21. mars 2013 14:47
> To: ocfs2-users@oss.oracle.com
> Subject: [Ocfs2-users] Shutting down one node caused all the other nodes to \
> shutdown aswell.
> Hi,
>
> We are running a 8 nodes cluster on RHEL 2.6.18-128 64-bit. Yesterday the \
> server/san guys exchanged the ocfs2 disks to another SAN, by mirroring and \
> synchronizing the disks. When they rebooted the servers, one of the nodes, \
> tos-dipsprod-07 wasn't able to start Oracle Grid Infrastructure, the voting disk \
> was not found. Then we tried to reboot that node, causing all nodes to reboot. Time \
> round about 02:25. When examine the /var/log/messages I discovered a BUG message on \
> one of the node that rebooted unexpectedly, tos-dipsprod-02. I've tried to google \
> it, but I couldn't find any solution. Is this a well known bug? Does any body have \
> a solution to this problem?
> Below is a extract of o2net and ocfs2 messages from the /var/log/message file.
>
> /var/log/messages til tos-dipsprod-07:
> Mar 21 02:08:49 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 \
> (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, \
> shutting it down.
> Mar 21 02:25:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 \
> (num 0) at 192.168.7.100:7777 has been idle for 10.0 seconds, \
> shutting it down.
> Mar 21 02:25:35 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 \
> (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, \
> shutting it down.
> Mar 21 02:25:40 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 \
> (num 2) at 192.168.7.102:7777 has been idle for 10.0 seconds, \
> shutting it down.
> Mar 21 02:25:45 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 \
> (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, \
> shutting it down.
> Mar 21 02:25:54 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-04 \
> (num 5) at 192.168.7.103:7777 has been idle for 10.0 seconds, \
> shutting it down.
> Mar 21 04:03:17 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 \
> (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, \
> shutting it down.
> Mar 21 04:06:32 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 \
> (num 0) at 192.168.7.100:7777 has been idle for 10.0 seconds, \
> shutting it down.
> Mar 21 04:06:37 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 \
> (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, \
> shutting it down.
> Mar 21 04:06:47 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 \
> (num 2) at 192.168.7.102:7777 has been idle for 10.0 seconds, \
> shutting it down.
> Mar 21 06:04:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 \
> (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, shutting it down.
> Og her fra tos-dipsprod-02:
> 10474-Mar 21 02:25:15 tos-dipsprod-02 kernel:
> (o2net,7452,5):dlm_begin_reco_handler:2730
> 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 7, node
> 3 changing it to 7 10646-Mar 21 02:25:25 tos-dipsprod-02 kernel:
> (o2net,7452,5):dlm_finalize_reco_handler:2839 ERROR: node 6 sent
> recovery finalize msg, but node 3 is supposed to be the new master,
> dead=7 10826:Mar 21 02:25:25 tos-dipsprod-02 kernel: Kernel BUG at
> ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840
> 10939-Mar 21 02:43:01 tos-dipsprod-02 syslogd 1.4.1: restart.
> 10995-Mar 21 02:43:02 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not \
> found.
> --
> 17537-Mar 21 04:06:19 tos-dipsprod-02 kernel:
> (o2net,7472,1):dlm_begin_reco_handler:2730
> 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 6, node
> 6 changing it to 7 17709-Mar 21 04:06:29 tos-dipsprod-02 kernel:
> (o2net,7472,1):dlm_finalize_reco_handler:2839 ERROR: node 6 sent
> recovery finalize msg, but node 255 is supposed to be the new master,
> dead=7 17891:Mar 21 04:06:29 tos-dipsprod-02 kernel: Kernel BUG at
> ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840
> 18004-Mar 21 04:38:04 tos-dipsprod-02 syslogd 1.4.1: restart.
> 18060-Mar 21 04:41:33 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not \
> found.
>
> Morten Kristiansen | Counsellor
> Helse Nord IKT | Departement of Serviceproduction
>
> Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Office address: Amtmann
> Worsøes gate 63, 8012 Bodø, Norway Quality - Safety - Respect
>
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
--
"Against stupidity the Gods themselves contend in vain."
- Friedrich von Schiller
http://www.jlbec.org/
jlbec@evilplan.org
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic