[prev in list] [next in list] [prev in thread] [next in thread] 

List:       tru64-unix-managers
Subject:    Lsm problem after a crash
From:       Didier Godefroy <ldg () ulysium ! net>
Date:       2009-05-13 12:03:31
Message-ID: C63083B3.28A22%ldg () ulysium ! net
[Download RAW message or body]

Hello all,

I'm experiencing a serious problem with my lsm setup after a crash.
System is tru64 5.1b and I have 7 disks, one of them was a hot spare and the
other 6 disks are a set of 3 with mirrors on all.

I thought perhaps a drive failure caused the crash, it happened while I was
deleting a large folder that contained a file that wouldn't get listed with
ls but that was causing an error when doing a du

The deletion of that folder (over 600mb) caused a kernel panic when reaching
that bad file. After that kernel panic, the boot wouldn't succeed because
the fsck would fail on that volume and caused a loop.

I had one of the drives from that mirrored volume pulled out and that
allowed the boot to succeed, which I think it most likely because removing
that drive only broke the mirror, which brought it offline and prevented the
fsck from working on it.
Now after the boot succeeded, all the volumes started re-syncing and the
broken mirror caused a relocation to the hot spare.

All this would probably have worked, but before all was finished, a crash
happened again and since that reboot that mirrored volume that was having
its mirror relocated to the hot spare stayed offline in a disabled state.

I waited for all the other re-syncing and relocations to be finished before
attempting to bring back online that rebuilt volume, but couldn't.

On the drives that contained that volume, there are actually 3 volumes, all
mirrored, but the two small volumes are back to a normal mirrored and online
state with the relocated plexes to the hot spare.
Only that large volume stayed offline:

v  srvvol       fsgen        DISABLED 59864864 -        ACTIVE   -       -
pl srv-pl-02    srvvol       DISABLED 59864863 -        ACTIVE   -       -
sd srv-sd-02    srv-pl-02    ENABLED  59864863 0        -        -       -
pl srv-pl-01    srvvol       DISABLED 59864864 -        STALE    -       -
sd spare-02     srv-pl-01    ENABLED  59864864 0        -        -       -

Everything else is kosher.

I'm wondering if the byte count may have something to do with this, the
subdisk with the spare name has an extra byte compared to the other original
one.

Trying to bring back online that volume gave this error:

volume start srvvol
lsm:volume: ERROR: Volume srvvol has no complete, non-volatile ACTIVE plexes

I tried disabling the spare plex, since it's marked as stale, and then
finally disassociated it:

volplex dis srv-pl-01

But after that plex was removed from that volume:

v  srvvol       fsgen        DISABLED 59864864 -        ACTIVE   -       -
pl srv-pl-02    srvvol       DISABLED 59864863 -        ACTIVE   -       -
sd srv-sd-02    srv-pl-02    ENABLED  59864863 0        -        -       -

I get the same error while trying to reactivate it.

The plex srv-pl-02 should still have the data from the broken mirror, the
plex srv-pl-01 is the one that's gone from removing the drive.

So why are the volume and plex disabled and active but can't be brought back
online?
How can I fix this?
I'm stuck.


Thanks for any hint,


-- 
Didier Godefroy
mailto:dg@ulysium.net


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic