'Re: kernel panic after upgrading to Linux 5.5'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-btrfs
Subject:    Re: kernel panic after upgrading to Linux 5.5
From:       Qu Wenruo <quwenruo.btrfs () gmx ! com>
Date:       2020-03-16 12:32:07
Message-ID: 1687a3f6-d263-c3f6-fae2-522a05830f10 () gmx ! com
[Download RAW message or body]

[Attachment #2 (multipart/mixed)]


On 2020/3/16 下午8:14, Tomasz Chmielewski wrote:
> On 2020-03-16 19:26, Qu Wenruo wrote:
>> On 2020/3/16 下午1:19, Tomasz Chmielewski wrote:
>>> On 2020-03-16 14:06, Qu Wenruo wrote:
>>>> On 2020/3/16 上午11:13, Tomasz Chmielewski wrote:
>>>>> After upgrading to Linux 5.5 (tried 5.5.6, 5.5.9, also 5.6.0-rc5), the
>>>>> system panics shortly after mounting and starting to use a btrfs
>>>>> filesystem. Here is a dmesg - please advise how to deal with it.
>>>>> It has since crashed several times, because of panic=10 parameter
>>>>> (system boots, runs for a while, crashes, boots again, and so on).
>>>>>
>>>>> Mount options:
>>>>>
>>>>> noatime,ssd,space_cache=v2,user_subvol_rm_allowed
>>>>>
>>>>>
>>>>>
>>>>> [     65.777428] BTRFS info (device sda2): enabling ssd optimizations
>>>>> [     65.777435] BTRFS info (device sda2): using free space tree
>>>>> [     65.777436] BTRFS info (device sda2): has skinny extents
>>>>> [     98.225099] BTRFS error (device sda2): parent transid verify failed
>>>>> on 19718118866944 wanted 664218442 found 674530371
>>>>> [     98.225594] BTRFS error (device sda2): parent transid verify failed
>>>>> on 19718118866944 wanted 664218442 found 674530371
>>>>
>>>> This is the root cause, not quota.
>>>>
>>>> The metadata is already corrupted, and quota is the first to complain
>>>> about it.
>>>
>>> Still, should it crash the server, putting it into a cycle of
>>> crash-boot-crash-boot, possibly breaking the filesystem even more?
>>
>> The transid mismatch in the first place is the cause, and I'm not sure
>> how it happened.
>>
>> Did you have any history of the kernel used on that server?
>>
>> Some potential corruption source includes the v5.2.0~v5.2.14, which
>> could cause some tree block not written to disk.
> 
> Yes, it used to run a lot of kernel, starting with 4.18 or perhaps even
> earlier.
> 
> 
>>> Also, how do I fix that corruption?
>>>
>>> This server had a drive added, a full balance (to RAID-10 for data and
>>> metadata) and scrub a few weeks ago, with no errors. Running scrub now
>>> to see if it shows up anything.
>>
>> Then at least at that time, it's not corrupted.
>>
>> Is there any sudden powerloss happened in recent days?
>> Another potential cause is out of spec FLUSH/FUA behavior, which means
>> the hard disk controller is not reporting correct FLUSH/FUA finish.
>>
>> That means if you use the same disk/controller, and manually to cause
>> powerloss, it would fail just after several cycle.
> 
> Powerloss - possibly there was.

Don't get me wrong, all modern fs should survive unexpected power loss
in theory.

If it has ran v5.2.0~v5.2.14, and power loss happened, it would be
pretty possible that v5.2.0~v5.2.14 is the cause.

If v5.2.0~v5.2.14 is not involved, and there is no extra layer between
btrfs and the block device, then I may suspect the disk (and maybe do
powerloss tests to ensure it's the disk not btrfs).

Anyway, to be clear again, if everything works as expected, then
powerloss shouldn't cause anything wrong on btrfs.

Thanks,
Qu

> 
> 
> Tomasz


["signature.asc" (application/pgp-signature)]

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic