[prev in list] [next in list] [prev in thread] [next in thread]
List: lustre-discuss
Subject: [Lustre-discuss] stuck OSS node
From: adrian () blinkenlights ! ch (Adrian Ulrich)
Date: 2011-08-05 9:01:47
Message-ID: 20110805110147.5a282884 () echelon ! ethz ! ch
[Download RAW message or body]
Hi Craig,
> Has anyone seen anything like this?
Yes: we had a similar problem a couple of times:
First, try to umount all OSTs on the affected OSS.
Some OSTs will (most likely) fail to umount. (umount gets stuck due to the \
ll_ost_io_?? thread). Note the 'broken' OSTs and kill the OSS (echo b > \
/proc/sysrq-trigger) after the 'good' OSTs finished umounting.
Afterwards do a simple 'e2fsck -f -p' on the bad OSTs - it should complain about \
corrupted directories and other nice things. If it doesn't -> upgrade to the latest \
fsck from whamcloud. (We had a corruption a few months ago that was unfixable/not \
detected with the 1.8.4-sun e2fsprogs)
> This is a recent phenomena - we are not
> sure, but we think it may be related to a particular workload. Our o2ib
> clients don't seem to have any trouble.
I don't think that this issue is related to the network: It's probably just 'bad \
luck' that only the tcp clients hit the corrupted directories.
Regards,
Adrian
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic