[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-raid
Subject:    Re: Reliability of RAID 5 repair function (mismatch_cnt 9560824)
From:       Michael Metze <michael.metze () mailbox ! org>
Date:       2018-02-23 9:22:03
Message-ID: 27b217a9-995a-2ca3-8440-61080b822b7a () mailbox ! org
[Download RAW message or body]

> On 21/02/18 21:09, Michael Metze wrote:
>> Hello there,
>>
>> I am running a RAID 5 consisting of 4 Seagate 4TB NAS Drives ST4000VN000
>> for 4 years now. The raid device is "scrubbed" every month using the
>> "check" function. There was never a problem. The filesystem is a
>> journaled ext4.
>>
>> Last week I added another external backup drive, and after a reboot, I
>> was missing disk 4 (sdd) of the RAID. It was physically turned on, no
>> error in the logs, but md0 was degraded. SMART data are fine. I added it
>> back manually, and since I use a bitmap, it was accepted immediately. I
>> run a "check" or scrub afterwards which went fine.
>>
> This backup is nothing to do with the raid, I presume? Is it on USB
> because that causes problems for raid? Whatever, if it's not part of the
> raid then copying TO it should not cause any problems.

Correct. I was reorganizing my backup structure, since my photography
directory with raw files was growing to big. There was simply a copy TO
USB and e-sata via separate e-sata port.

>> Anyway, after some heavy copy actions on the raid, I moved about 1/3 of
>> the data to the new backup drive, since I do not need it on the RAID.
>> After another reboot, the mount process failed, reported the fsck was
>> not clean. I started a fsck, but this one was reporting massive inode
>> errors ... so I stopped it, to run another "check" on the RAID, which
>> gave me a mismatch_cnt 9560824, which seems to be quite high.
> 
> If you've never had any errors before, that really is a lot!
Interestingly, when I lost drive 4 the first time, there was no error
during scrub. Unfortunately, I remember losing it a second time. This
time, the drive was rebuild - it think this was the moment where errors
where introduced.

I have a backup of a big directory (150G) which is still accessable
and readable on the raid. On your advice - and using an non-destructive
overlay file approach - I did 5 comparisons of backup/raid-content using
different raid assemblings (diff command).

UUUU massive diffs/errors
_UUU very few diffs/errors
U_UU massive diffs/errors
UU_U massive diffs/errors
UUU_ no diffs/errors

This seems to be proof of a significant problem with drive 4 during
rebuild. So drive 4 has wrong data.

>> Right now I can mount the filesystem read-only, but two important
>> directories, which I didn't touch for almost 2 years are gone. I can not
>> explain what went wrong.
>>
>> I read and understood
>> https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives
>> "With a raid-5 array the only thing that can be done when there is an
>> error is to correct the parity. This is also the most likely error - the
>> scenario where the data has been flushed and the parity not updated is
>> the expected cause of problems like this."
>>
>> Is there any way to detect which drive has a problem? Of course I
>> suspect drive 4. How reliable is the repair function of mdadm? I want to
>> make sure, the RAID integrity is OK before I try to recover data from
>> the filesystem, which is probably quite a big next step. Otherwise I may
>> consider to try a repair with one drive 1-3 assembled in the RAID.
> 
> Okay. Run a SMART test on all the drives, especially drive 4.

Done. Still no errors using smart-ctl -a. Long & Short test performed.
Please see
https://pastebin.com/Cj3TGYLR

> 
> If you suspect a failed drive, then *DO* *NOT* run a repair, because
> this is not the normal "corrupt parity" problem - parity is scattered
> across all drives which means a lot of *data* is corrupted, which means
> a repair will trash it forever.

Understood. I guest since my missing photography folders where not
written for almost a year - they should remain intact on drive 1-3.

>> Many many thanks for any hints in understanding the situation.
>> Michael
>>
> Okay, take drive 4 out, do a force-assemble of the other three, and try
> a check-only fsck. If that says everything is okay, then you know drive
> 4 is dud.
> 
> I'll leave you with that for the moment - come back with the results of
> the SMART and the three-drive fsck.

Unforunately, I ran an incomplete fsck, which I aborted due to massive
errors. This action may habe introducted the damage to the file system
structure.

fsck-results:
https://www.dropbox.com/sh/wxfa13ace68edr3/AABBQeapjGlKa70ihMPFkkgGa?dl=0
see File fsck.UUU_
terrible result 69MB

the filesystem is journaled, I will try different superblocks later ...
but corrupt file system structure causes despair ...

When I mount the raid read-only I get results like
d?????????   ? ?   ?          ?            ? Source
d?????????   ? ?   ?          ?            ? NikonTransfer

after (nondestructive) repair the directories are vanished


> In the meantime, think seriously about going raid-6. You've backed u
> 1/3 of your 12GB - does that mean you could resize your array as a 8GB
> raid-6? Or could you add a fifth drive for a 12GB raid-6?
> 
> Cheers,
> Wol

I WILL definitely do this. ZFS RAIDZ6 meight be a good option since the
parity, filesysten structure and repair is in one hand.

Thanks a lot.


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic