'Re: RAID6 gets stuck during reshape with 100% CPU'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-raid
Subject:    Re: RAID6 gets stuck during reshape with 100% CPU
From:       Anssi Hannula <anssi.hannula () iki ! fi>
Date:       2019-10-30 18:25:31
Message-ID: c8b37bc022aca270102fe7114be7051e () iki ! fi
[Download RAW message or body]

Song Liu kirjoitti 2019-10-29 23:55:
> On Tue, Oct 29, 2019 at 1:45 PM Anssi Hannula <anssi.hannula@iki.fi> 
> wrote:
> > 
> > Song Liu kirjoitti 2019-10-29 22:28:
> > > On Tue, Oct 29, 2019 at 12:05 PM Anssi Hannula <anssi.hannula@iki.fi>
> > > wrote:
> > > > 
> > > > Song Liu kirjoitti 2019-10-29 08:04:
> > > > > I guess we get into "is_bad", case, but it should not be the case?
> > > > 
> > > > Right, is_bad is set, which causes R5_Insync and R5_ReadError to be
> > > > set
> > > > on lines 4497-4498, and R5_Insync to be cleared on line 4554 (if
> > > > R5_ReadError then clear R5_Insync).
> > > > 
> > > > As mentioned in my first message and seen in
> > > > http://onse.fi/files/reshape-infloop-issue/examine-all.txt , the MD
> > > > bad
> > > > block lists contain blocks (suspiciously identical across devices).
> > > > So maybe the code can't properly handle the case where 10 devices have
> > > > the same block in their bad block list. Not quite sure what "handle"
> > > > should mean in this case but certainly something else than a
> > > > handle_stripe() loop :)
> > > > There is a "bad" block on 10 devices on sector 198504960, which I
> > > > guess
> > > > matches sh->sector 198248960 due to data offset of 256000 sectors (per
> > > > --examine).
> > > 
> > > OK, it makes sense now. I didn't add the data offset when checking the
> > > bad
> > > block data.
> > > 
> > > > 
> > > > I've wondered if "dd if=/dev/md0 of=/dev/md0" for the affected blocks
> > > > would clear the bad blocks and avoid this issue, but I haven't tried
> > > > that yet so that the infinite loop issue can be investigated/fixed
> > > > first. I already checked that /dev/md0 is fully readable (which also
> > > > confuses me a bit since md(8) says "Attempting to read from a known
> > > > bad
> > > > block will cause a read error"... maybe I'm missing something).
> > > > 
> > > 
> > > Maybe try these steps?
> > > 
> > > https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy#How_do_I_fix_a_Bad_Blocks_problem.3F
> > > 
> > 
> > Yeah, I guess those steps would probably resolve my situation. BTW,
> > "--update=force-no-bbl" is not mentioned on mdadm(8), is it on 
> > purpose?
> > I was trying to find such an option earlier.
> > 
> > If you don't need anything more from the array, I'll go ahead and try
> > clearing the seemingly bogus bad block lists.
> 
> Please go ahead. We already got quite a few logs.

Seems that was indeed the issue, clearing the bad block log allowed the 
reshape to continue normally.

Thanks for your help.

-- 
Anssi Hannula


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic