[prev in list] [next in list] [prev in thread] [next in thread]
List: linux-btrfs
Subject: Re: [PATCH] btrfs: free device if we fail to open it
From: Anand Jain <anand.jain () oracle ! com>
Date: 2021-12-03 8:31:49
Message-ID: b25ba451-18f3-073f-0691-c99b10fd8c9a () oracle ! com
[Download RAW message or body]
On 03/12/2021 00:02, Josef Bacik wrote:
> On Thu, Dec 02, 2021 at 03:09:38PM +0800, Anand Jain wrote:
>> On 02/12/2021 05:18, Josef Bacik wrote:
>>> We've been having transient failures of btrfs/197 because sometimes we
>>> don't show a missing device.
>>
>>> This turned out to be because I use LVM for my devices, and sometimes we
>>> race with udev and get the "/dev/dm-#" device registered as a device in
>>> the fs_devices list instead of "/dev/mapper/vg0-lv#" device.
>>> Thus when
>>> the test adds a device to the existing mount it doesn't find the old
>>> device in the original fs_devices list and remove it properly.
>>
>
> I think most of your confusion is because you don't know what btrfs/197 does, so
> I'll explain that and then answer your questions below.
>
> DEV=/dev/vg0/lv0
> RAID_DEVS=/dev/vg0/lv1 /dev/vg0/lv2 /dev/vg0/vl3 /dev/vg0/lv4
>
> # First we create a single fs and mount it
> mkfs.btrfs -f $DEV
> mount $DEV /mnt/test
>
> # Now we create the RAID fs
> mkfs.btrfs -f -d raid10 -m raid10 $RAID_DEVS
>
> # Now we add one of the raid devs to the single mount above
> btrfs device add /dev/vg0/lv2 /mnt/test
>
> # /dev/vg0/lv2 is no longer part of the fs it was made on, it's part of the fs
> # that's mounted at /mnt/test
>
> # Mount degraded with the raid setup
> mount -o degraded /dev/vg0/lv1 /mnt/scratch
>
> # Now we shouldn't have found /dev/vg0/lv2, because it was wiped and is no
> # longer part of the fs_devices for this thing, except it is because it wasn't
> # removed, so when we do the following it doesn't show as missing
> btrfs filesystem show /mnt/scratch
>
I thought I understood the test case. Now it is better. Thanks for
taking the time to explain.
>> The above para is confusing. It can go. IMHO. The device path shouldn't
>> matter as we depend on the bdev to compare in the device add thread.
>>
>> 2637 bdev = blkdev_get_by_path(device_path, FMODE_WRITE |
>> FMODE_EXCL,
>> 2638 fs_info->bdev_holder);
>> ::
>> 2657 list_for_each_entry_rcu(device, &fs_devices->devices, dev_list)
>> {
>> 2658 if (device->bdev == bdev) {
>> 2659 ret = -EEXIST;
>> 2660 rcu_read_unlock();
>> 2661 goto error;
>> 2662 }
>> 2663 }
>>
>
> This is on the init thread, this is just checking the fs_devices of /mnt/test,
> not the fs_devices of the RAID setup that we created, so this doesn't error out
> (nor should it) because we're adding it to our mounted fs.
>
>>
>>> This is fine in general, because when we open the devices we check the
>>> UUID, and properly skip using the device that we added to the other file
>>> system. However we do not remove it from our fs_devices,
>>
>> Hmm, context/thread is missing. Like, is it during device add or during
>> mkfs/dev-scan?
>>
>> AFAIK btrfs_free_stale_devices() checks and handles the removing of stale
>> devices in the fs_devices in both the contexts/threads device-add, mkfs
>> (device-scan).
>>
>> For example:
>>
>> $ alias cnt_free_stale_devices="bpftrace -e 'kprobe:btrfs_free_stale_devices
>> { @ = count(); }'"
>>
>> New FSID on 2 devices, we call free_stale_devices():
>>
>> $ cnt_free_stale_devices -c 'mkfs.btrfs -fq -draid1 -mraid1 /dev/vg/scratch0
>> /dev/vg/scratch1'
>> Attaching 1 probe...
>>
>> @: 2
>>
>> We do it only when there is a new fsid/device added to the fs_devices.
>>
>> For example:
>>
>> Clean up the fs_devices:
>> $ cnt_free_stale_devices -c 'btrfs dev scan --forget'
>> Attaching 1 probe...
>>
>> @: 1
>>
>> Now mounting devices are new to the fs_devices list, so call
>> free_stale_devices().
>>
>> $ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0
>> /dev/vg/scratch1 /btrfs'
>> Attaching 1 probe...
>>
>> @: 2
>>
>> $ umount /btrfs
>>
>> Below we didn't call free_stale_devices() because these two devices are
>> already in the appropriate fs_devices.
>>
>> $ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0
>> /dev/vg/scratch1 /btrfs'
>> Attaching 1 probe...
>>
>> @: 0
>>
>> $
>>
>> To me, it looks to be working correctly.
>>
>
> Yes it does work correctly, most of the time. If you run it in a loop 500 times
> it'll fail, because _sometimes_ udev goes in and does teh btrfs device scan and
> changes the name of the device in the fs_devices for the RAID group. So the
> btrfs_free_stale_devices() thing doesn't find the exising device, because it's
> just looking at the device->name, which is different from the device we're
> adding.
>
> We have the fs_devices for the RAID fs, and instead of /dev/vg0/lv2, we have
> /dev/dm-4 or whatever. So we do the addition of /dev/vg0/lv2, go to find it in
> any other fs_devices, and don't find it because strcmp("/dev/vg0/lv2",
> "/dev/dm0-4") != 0, and thus leave the device on the fs_devices for the RAID
> file system.
>
I got it. It shouldn't be difficult to reproduce and, I could
reproduce. Without this patch.
Below is a device with two different paths. dm and its mapper.
----------
$ ls -bli /dev/mapper/vg-scratch1 /dev/dm-1
561 brw-rw---- 1 root disk 252, 1 Dec 3 12:13 /dev/dm-1
565 lrwxrwxrwx 1 root root 7 Dec 3 12:13 /dev/mapper/vg-scratch1
-> ../dm-1
----------
Clean the fs_devices.
----------
$ btrfs dev scan --forget
----------
Use the mapper to do mkfs.btrfs.
----------
$ mkfs.btrfs -fq /dev/mapper/vg-scratch0
$ mount /dev/mapper/vg-scratch0 /btrfs
----------
Crete raid1 again using mapper path.
----------
$ mkfs.btrfs -U $uuid -fq -draid1 -mraid1 /dev/mapper/vg-scratch1
/dev/mapper/vg-scratch2
----------
Use dm path to add the device which belongs to another btrfs filesystem.
----------
$ btrfs dev add -f /dev/dm-1 /btrfs
----------
Now mount the above raid1 in degraded mode.
----------
$ mount -o degraded /dev/mapper/vg-scratch2 /btrfs1
----------
Before patch:
The /dev/mapper/vg-scratch1 is not shown as missing in the /btrfs1.
----------
$ btrfs fi show -m /btrfs1
Label: none uuid: 12345678-1234-1234-1234-123456789abc
Total devices 2 FS bytes used 704.00KiB
devid 1 size 10.00GiB used 1.26GiB path /dev/mapper/vg-scratch1
devid 2 size 10.00GiB used 2.54GiB path /dev/mapper/vg-scratch2
$ btrfs fi show -m /btrfs
Label: none uuid: e89a49c3-66a2-473c-aadf-3bf23d37a3a3
Total devices 2 FS bytes used 192.00KiB
devid 1 size 10.00GiB used 536.00MiB path /dev/mapper/vg-scratch0
devid 2 size 10.00GiB used 0.00B path /dev/mapper/vg-scratch1
----------
The kernel does set the devid-1 as missing.
(Checked using boilerplate code, which dumps fs_devices).
----------
$ cat /proc/fs/btrfs/devlist
::
[fsid: 12345678-1234-1234-1234-123456789abc]
::
device: /dev/dm-1 <---
devid: 1
::
dev_state: |IN_FS_METADATA|MISSING|dev_stats_valid
----------
After patch:
------------
$ btrfs fi show /btrfs1
Label: none uuid: 12345678-1234-1234-1234-123456789abc
Total devices 2 FS bytes used 128.00KiB
devid 2 size 10.00GiB used 1.26GiB path /dev/mapper/vg-scratch2
*** Some devices missing
$ btrfs fi show /btrfs
Label: none uuid: 761115c2-7fc2-4dc8-afab-baf2509ea6fe
Total devices 2 FS bytes used 192.00KiB
devid 1 size 10.00GiB used 536.00MiB path /dev/mapper/vg-scratch0
devid 2 size 10.00GiB used 0.00B path /dev/mapper/vg-scratch1
------------
------------
$ cat /proc/fs/btrfs/devlist
::
[fsid: 12345678-1234-1234-1234-123456789abc]
::
device: (null) <---
devid: 1
::
dev_state: |IN_FS_METADATA|MISSING|dev_stats_valid
-------------
>>> so when we get
>>> the device info we still see this old device and think that everything
>>> is ok.
>>
>>
>>> We have a check for -ENODATA coming back from reading the super off of
>>> the device to catch the wipefs case, but we don't catch literally any
>>> other error where the device is no longer valid and thus do not remove
>>> the device.
>>>
>>
>>> Fix this to not special case an empty device when reading the super
>>> block, and simply remove any device we are unable to open.
>>>
>>> With this fix we properly print out missing devices in the test case,
>>> and after 500 iterations I'm no longer able to reproduce the problem,
>>> whereas I could usually reproduce within 200 iterations.
>>>
>>
>> commit 7f551d969037 ("btrfs: free alien device after device add")
>> fixed the case we weren't freeing the stale device in the device-add
>> context.
>>
>> However, here I don't understand the thread/context we are missing to free
>> the stale device here.
>>
>>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>>> ---
>>> fs/btrfs/disk-io.c | 2 +-
>>> fs/btrfs/volumes.c | 2 +-
>>> 2 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 5c598e124c25..fa34b8807f8d 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -3924,7 +3924,7 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev,
>>> super = page_address(page);
>>> if (btrfs_super_magic(super) != BTRFS_MAGIC) {
>>> btrfs_release_disk_super(super);
>>> - return ERR_PTR(-ENODATA);
>>> + return ERR_PTR(-EINVAL);
>>> }
>>
>> I think you are removing ENODATA because it is going away in the parent
>> function. And, we don't reach this condition in the test case btrfs/197.
>> Instead, we fail here at line 645 below and, we return -EINVAL;
>
> I'm changing it to be consistent, instead of special casing this one specific
> failure, just treat all failures like we need to remove the device, and thus we
> can just make this be EINVAL as well.
Hmm, as this change is not related to the bug fix here. Can it be a
separate patch? The actual bug fix has to go to stable kernels as well.
>
>>
>> 645 if (memcmp(device->uuid, disk_super->dev_item.uuid,
>> BTRFS_UUID_SIZE))
>> 646 goto error_free_page;
>> 647
>>
>> 687 error_free_page:
>> 688 btrfs_release_disk_super(disk_super);
>> 689 blkdev_put(bdev, flags);
>> 690
>> 691 return -EINVAL;
>>
>> function stack carries the return code to the parent open_fs_devices().
>>
>> open_fs_devices()
>> btrfs_open_one_device()
>> btrfs_get_bdev_and_sb()
>> btrfs_read_dev_super()
>> btrfs_read_dev_one_super()
>>
>>
>>
>>> if (btrfs_super_bytenr(super) != bytenr_orig) {
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index f38c230111be..890153dd2a2b 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -1223,7 +1223,7 @@ static int open_fs_devices(struct btrfs_fs_devices *fs_devices,
>>> if (ret == 0 &&
>>> (!latest_dev || device->generation > latest_dev->generation)) {
>>> latest_dev = device;
>>> - } else if (ret == -ENODATA) {
>>> + } else if (ret) {
>>> fs_devices->num_devices--;
>>> list_del(&device->dev_list);
>>> btrfs_free_device(device);
>>>
>>
>> There are many other cases, for example including -EACCES (from
>> blkdev_get_by_path()) (I haven't checked other error codes). For which,
>> calling btrfs_free_device() is inappropriate.
>
> Yeah I was a little fuzzy on this. I think *any* failure should mean that we
> remove the device from the fs_devices tho right? So that we show we're missing
> a device, since we can't actually access it? I'm actually asking, because I
> think we can go either way, but to me I think any failure sure result in the
> removal of the device so we can re-scan the correct one. Thanks,
>
It is difficult to generalize, I guess. For example, consider the
transient errors during the boot-up and the errors due to slow to-ready
devices or the system-related errors such as ENOMEM/EACCES, all these
does not call for device-free. If we free the device for transient
errors, any further attempt to mount will fail unless it is device-scan
again.
Here the bug is about btrfs_free_stale_devices() which failed to
identify the same device when tricked by mixing the dm and mapper paths.
Can I check with you if there is another way to fix this by checking the
device major and min number or the serial number from the device inquiry
page?
Thanks, Anand
> Josef
>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic