[prev in list] [next in list] [prev in thread] [next in thread]
List: lustre-discuss
Subject: [Lustre-discuss] Serious problem with OSTs
From: andreas.dilger () oracle ! com (Andreas Dilger)
Date: 2010-12-30 6:47:21
Message-ID: E120DCD0-1912-4D95-A3E4-E617F9A5B932 () oracle ! com
[Download RAW message or body]
On 2010-12-29, at 20:22, "Mervini, Joseph A" <jamervi at sandia.gov> wrote:
>
> And examining the LUN with tunefs.lustre produces the following:
>
> [root at rio37 ~]# tunefs.lustre /dev/sdf
> checking for existing Lustre data: found last_rcvd
> tunefs.lustre: Unable to read 1.6 config /tmp/dirUvdBcz/mountdata.
That means the mountdata file is likely either missing or corrupted somehow.
> Read previous values:
> Target:
> Index: 54
> UUID: ostr)o37sdf_UID
> Lustre FS: lustre
> Mount type: ldiskfs
> Flags: 0x202
> (OST upgrade1.4 )
> Persistent mount opts:
> Parameters:
>
> I suspected that there were file system inconsistencies so I ran fsck on one of the \
> target and got a large number of errors, primarily "Multiply-claimed blocks" \
> running e2fsck -fp and when it completed the OS told me I needed to run fsck \
> manually which I did with the "-fy" options. This dumped a ton of inodes to \
> lost+found. In addition, when it started it converted the file system from ext3 to \
> ext2 during the fsck and then recreated the journal when it completed.
There was some sort of device-level corruption in this case. The e2fsck fixed it as \
much as possible, and you should run ll_recover_lost_found_objs on the mounted \
filesystem.
> However, I was still unable to mount the LUN and tunefs.lustre still had the FATAL \
> condition shown above.
> I AM able to mount all of the LUNs as ldiskfs devices so I suspect that the lustre \
> config for those OSTs just got clobbered somehow. Also, looking at the inodes that \
> were dumped to lost+found, most of them have timestamps that are more that a year \
> old that by policy should have been purged so I'm wondering if it is just an \
> artifact of the file system not being checked for a very long time.
That depends in atime, which is normally only updated on the MDS on disk.
> Other things to note is the OSS is fiber channel attached to a DDN 9500 and the \
> OSTs that are having problems are associated with one controller of the couplet. \
> That is suspicious, but because neither controller is showing any faults I suspect \
> that whatever has occurred did not happen recently.
It does seem to be the smoking gun.
> In addition, the /CONFIG/mountdata on all the targets originally had a timestamp \
> of Aug 3 14:05 (and still does for the targets that can't be mounted).
> So I have two questions:
>
> How can I restore the config data on the OSTs that are having problems?
I think there was a thread on rebuilding the mountdata file recently.
> What does "Multiply-claimed blocks" mean and does it indicate corruption?
Disk-level corruption.
> I am afraid that running e2fsck may have compounded my problems and am holding off \
> on doing any file system checks on the other 2 target.
Well, it is needed at some point...
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic