[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ssic-linux-devel
Subject:    [SSI-devel] [ ssic-linux-Bugs-926808 ] node kernel panic
From:       "SourceForge.net" <noreply () sourceforge ! net>
Date:       2004-08-27 23:35:58
Message-ID: E1C0qGE-0006XC-00 () sc8-sf-web4 ! sourceforge ! net
[Download RAW message or body]

Bugs item #926808, was opened at 2004-03-31 17:09
Message generated for change (Comment added) made by lramirez
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=926808&group_id=32541

Category: Process Management
Group: None
>Status: Closed
Resolution: None
Priority: 5
Submitted By: jsu2 (jsu2)
Assigned to: Laura Ramirez (lramirez)
Summary: node kernel panic

Initial Comment:
The three nodes on my four node debian cluster has
locked up again.  The init node seems to be up, but not
useable.  Included is the bt output from kdb of the
nodes.  Each node has similar output.  The only
difference that I could tell was the process id.

----------------------------------------------------------------------

>Comment By: Laura Ramirez (lramirez)
Date: 2004-08-27 23:35

Message:
Logged In: YES 
user_id=300036

Aaron Krowne has confirmed that the cluster has not crashed
recently.
I am closing this bug.  Johnny Healey, who is now in charge
of the cluster can reopen it if the panics reoccur.

----------------------------------------------------------------------

Comment By: jsu2 (jsu2)
Date: 2004-04-26 20:08

Message:
Logged In: YES 
user_id=1010539

The cluster didn't crash per se this time, but it looks like
the memory leak caused it to kill off a bunch of processes
leaving the system useless.  I turned off netdump this time
which may be what kept it from actually crashing like
before; I can't verify this, though.  The screen shots are
from the initnode.  The top of the first picture is the
output of "call show_free_areas".

http://br.endernet.org/~akrowne/metacluster_dump-4/

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-04-22 16:34

Message:
Logged In: NO 

More node crashes... output from one of the nodes:

http://br.endernet.org/~akrowne/metacluster_dump-3/web/img_0003.jpg
http://br.endernet.org/~akrowne/metacluster_dump-3/web/img_0004.jpg
http://br.endernet.org/~akrowne/metacluster_dump-3/web/img_0005.jpg
http://br.endernet.org/~akrowne/metacluster_dump-3/web/img_0006.jpg

Screenshot of inittnode:

http://br.endernet.org/~akrowne/metacluster_dump-3/web/img_0007.jpg
http://br.endernet.org/~akrowne/metacluster_dump-3/web/img_0008.jpg
http://br.endernet.org/~akrowne/metacluster_dump-3/web/img_0009.jpg

----------------------------------------------------------------------

Comment By: jsu2 (jsu2)
Date: 2004-04-16 15:42

Message:
Logged In: YES 
user_id=1010539

Nodes went down again.  Could not get any response from the
node consoles, but I was able to get a bt output from the
initnode.

http://br.endernet.org/~akrowne/metacluster_dump-2/original/img_0003.jpg

----------------------------------------------------------------------

Comment By: jsu2 (jsu2)
Date: 2004-04-14 18:01

Message:
Logged In: YES 
user_id=1010539

Okay, I got another kernel panic about two days after I
rebooted with the latest kernel patches in CVS.  Here's what
netdump was able to catch:

icssvr_daemon: skipping hdl 0xf6358800 from down node 2
(transid 0xf7f06000)
Node 4 has gone down!!!
vproc_slave: unable to spawn VPROC slave daemon
Kernel panic: Timed out waiting for nodedown to complete!
Instruction(i) breakpoint #0 at 0xc0124900 (adjusted)
0xc0124900 panic_hook:         int3   

Entering kdb (current=0xf5f74000, pid 65904) on processor 0
due to Breakpoint @ 0xc0124900
[0]kdb> 

This is the initnode output.  Some of it scrolled beyond the
top of the screen:
http://br.endernet.org/~akrowne/metacluster_dump/original/img_0036.jpg

This is also the initnode output.  I tried running bt again,
but it only output a few lines before locking up:
http://br.endernet.org/~akrowne/metacluster_dump/original/img_0038.jpg


The kdb output from each of the three nodes:
http://br.endernet.org/~akrowne/metacluster_dump/original/img_0039.jpg
http://br.endernet.org/~akrowne/metacluster_dump/original/img_0040.jpg
http://br.endernet.org/~akrowne/metacluster_dump/original/img_0041.jpg


----------------------------------------------------------------------

Comment By: jsu2 (jsu2)
Date: 2004-04-09 15:32

Message:
Logged In: YES 
user_id=1010539

Initnode lockup on 2004-04-09.

----------------------------------------------------------------------

Comment By: jsu2 (jsu2)
Date: 2004-04-05 15:15

Message:
Logged In: YES 
user_id=1010539

Okay, I let the initnode up this weekend (without the other
nodes).  And, now I get out of memory errors.  The system
seems to be up, but all the processes seem to have been
killed.  So, the boxed can be pinged, but can't be accessed
remotely.  

I have 2.5GB of RAM and like 1GB of swap.  When I log into
the console, the output of "free" shows only half the total
available RAM is being used.  So, I'm pretty sure the system
isn't out of memory.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=926808&group_id=32541


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
ssic-linux-devel mailing list
ssic-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic