[prev in list] [next in list] [prev in thread] [next in thread]
List: lustre-discuss
Subject: [Lustre-discuss] OSS/MDS server crashes after recovery.
From: Roy.Dragseth () cc ! uit ! no (Roy Dragseth)
Date: 2006-12-29 2:55:20
Message-ID: 200612291055.14328.Roy.Dragseth () cc ! uit ! no
[Download RAW message or body]
We had a disk failure on one of our raid arrays and had to do fsck on all
ost/mds-devices. The fsck process did a lot of repairs, but eventually it
succeded cleaning all devices. When I try to start lustre again I everything
seems fine until I try to connect the clients, then one of the two combined
mds/oss servers crashes as soon as the recovery grace period is over. If I
start with --abort_recovery the server hangs almost immediately.
It dumps a lot of debug log files in the /tmp directory but I cannot make any
sense of them. The syslog gets filled with things like this:
Dec 29 10:52:36 lustre-11-1 kernel: LustreError: 6943:0:
(client.c:554:ptlrpc_check_reply()) previously skipped 1 similar messages
Dec 29 10:52:36 lustre-11-1 kernel: Lustre:
OSC_lustre-11-0.local_ost8_home-mds: Connection restored to service ost8
using nid 0@lo.
Dec 29 10:52:36 lustre-11-1 kernel: Lustre: previously skipped 1 similar
messages
Dec 29 10:52:36 lustre-11-1 kernel: LustreError: 7113:0:
(lov_obd.c:837:lov_clear_orphans()) error in orphan recovery on OST idx 0/4:
rc = -16
Dec 29 10:52:36 lustre-11-1 kernel: LustreError: 7113:0:
(lov_obd.c:837:lov_clear_orphans()) previously skipped 1 similar messages
This crash makes it impossible to continue into the lfsck realm of fixing
things.
System info:
RH EL4 w/2.6.9-34.EL_lustre1.4.6.4smp
lustre 1.4.6.4
Lustre setup:
Two combined MDS/OSS servers with dual FC connections to two sata raids,
serving a home area and a scratch area.
Any help is greatly appreciated, my last resort is to reformat and roll
everything in from backup.
Regards,
r.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic