[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ssic-linux-users
Subject:    [SSI-users] Cluster Stability, need help..
From:       "J. Miribel" <j.miribel () nitroserv ! com>
Date:       2006-12-15 12:14:51
Message-ID: 458291BB.8000208 () nitroserv ! com
[Download RAW message or body]

Hi everyone,

I've been running OpenSSI for a few weeks now and I am experiencing a 
very big stability issue:
The cluster stays up for a few days (about 3 to 5 days at most), and 
then a random ssh connection attempt makes it crash.

On my ssh client, I get a "stdin: is not a tty" error message (after 2 
mins of waiting with nothing happening)
On the logs I get those lines
Dec 15 11:57:41 maitre1 sshd[127801]: error: openpty: Object is remote
Dec 15 11:57:41 maitre1 sshd[127801]: error: session_pty_req: session 0 
alloc failed

Of course I get no terminal when I get those errors..

Then MOSIX load climbs up to 10000 or so,  and network connection 
(understand "ability to ping the node") to "non-init" node is lost..
A few seconds later network connection to init node is lost as well but 
only for a few seconds, lets says 10 seconds..

When I can reach init node again using ssh, all non-init nodes are off, 
and I see those errors on syslog:
Dec 15 11:55:06 maitre1 kernel: spurious 8259A interrupt: IRQ7.
Dec 15 11:55:33 maitre1 kernel: Node 7 has gone down!!!
Dec 15 11:55:33 maitre1 kernel: Node 19 has gone down!!!
Dec 15 11:55:33 maitre1 kernel: Node 20 has gone down!!!
Dec 15 11:55:33 maitre1 kernel: Node 22 has gone down!!!
Dec 15 11:55:33 maitre1 kernel: Node 24 has gone down!!!
Dec 15 11:55:33 maitre1 kernel: nm_received_imalive:unexpected packet 
from node 22
Dec 15 11:55:33 maitre1 kernel: nm_received_imalive:unexpected packet 
from node 19
Dec 15 11:55:33 maitre1 kernel: nm_received_imalive:unexpected packet 
from node 20
Dec 15 11:55:33 maitre1 kernel: nm_received_imalive:unexpected packet 
from node 7
etc etc...

I have no root failover, is what I'm experiencing a root failure ?

My pxelinux.cfg/default file:
append initrd=initrd ro noapictimer acpi=off noapic nolapic

my grub config:
/boot/vmlinuz-2.6.10-ssi-686-smp root=/dev/sda1 ro noapic nolapic

I'm running Openssi 1.9.1 on debian...

The cluster is made of  2 types of machines:
dual opteron and P4 dual core. 30 nodes are in the cluster...

Anyone ever experienced those random crashs ? Any known fix ?
Any help would be really appreciated.

Best regards,
Julian

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Ssic-linux-users mailing list
Ssic-linux-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ssic-linux-users
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic