'[Lustre-discuss] stuck OSS node'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lustre-discuss
Subject:    [Lustre-discuss] stuck OSS node
From:       adrian () blinkenlights ! ch (Adrian Ulrich)
Date:       2011-08-05 9:01:47
Message-ID: 20110805110147.5a282884 () echelon ! ethz ! ch
[Download RAW message or body]

Hi Craig,

> Has anyone seen anything like this?

Yes: we had a similar problem a couple of times:


First, try to umount all OSTs on the affected OSS.

Some OSTs will (most likely) fail to umount. (umount gets stuck due to the \
ll_ost_io_?? thread). Note the 'broken' OSTs and kill the OSS (echo b > \
/proc/sysrq-trigger) after the 'good' OSTs finished umounting.

Afterwards do a simple 'e2fsck -f -p' on the bad OSTs - it should complain about \
corrupted directories and other nice things. If it doesn't -> upgrade to the latest \
fsck from whamcloud. (We had a corruption a few months ago that was unfixable/not \
detected with the 1.8.4-sun e2fsprogs)



> This is a recent phenomena - we are not 
> sure, but we think it may be related to a particular workload.  Our o2ib 
> clients don't seem to have any trouble.

I don't think that this issue is related to the network: It's probably just 'bad \
luck' that only the tcp clients hit the corrupted directories.



Regards,
 Adrian


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic