[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-btrfs
Subject:    Re: [PATCH] btrfs: free device if we fail to open it
From:       Anand Jain <anand.jain () oracle ! com>
Date:       2021-12-03 8:31:49
Message-ID: b25ba451-18f3-073f-0691-c99b10fd8c9a () oracle ! com
[Download RAW message or body]



On 03/12/2021 00:02, Josef Bacik wrote:
> On Thu, Dec 02, 2021 at 03:09:38PM +0800, Anand Jain wrote:
>> On 02/12/2021 05:18, Josef Bacik wrote:
>>> We've been having transient failures of btrfs/197 because sometimes we
>>> don't show a missing device.
>>
>>> This turned out to be because I use LVM for my devices, and sometimes we
>>> race with udev and get the "/dev/dm-#" device registered as a device in
>>> the fs_devices list instead of "/dev/mapper/vg0-lv#" device.
>>>   Thus when
>>> the test adds a device to the existing mount it doesn't find the old
>>> device in the original fs_devices list and remove it properly.
>>
> 
> I think most of your confusion is because you don't know what btrfs/197 does, so
> I'll explain that and then answer your questions below.
> 
> DEV=/dev/vg0/lv0
> RAID_DEVS=/dev/vg0/lv1 /dev/vg0/lv2 /dev/vg0/vl3 /dev/vg0/lv4
> 
> # First we create a single fs and mount it
> mkfs.btrfs -f $DEV
> mount $DEV /mnt/test
> 
> # Now we create the RAID fs
> mkfs.btrfs -f -d raid10 -m raid10 $RAID_DEVS
> 
> # Now we add one of the raid devs to the single mount above
> btrfs device add /dev/vg0/lv2 /mnt/test
> 
> # /dev/vg0/lv2 is no longer part of the fs it was made on, it's part of the fs
> # that's mounted at /mnt/test
> 
> # Mount degraded with the raid setup
> mount -o degraded /dev/vg0/lv1 /mnt/scratch
> 
> # Now we shouldn't have found /dev/vg0/lv2, because it was wiped and is no
> # longer part of the fs_devices for this thing, except it is because it wasn't
> # removed, so when we do the following it doesn't show as missing
> btrfs filesystem show /mnt/scratch
> 

  I thought I understood the test case. Now it is better. Thanks for 
taking the time to explain.


>>   The above para is confusing. It can go. IMHO. The device path shouldn't
>>   matter as we depend on the bdev to compare in the device add thread.
>>
>> 2637         bdev = blkdev_get_by_path(device_path, FMODE_WRITE |
>> FMODE_EXCL,
>> 2638                                   fs_info->bdev_holder);
>> ::
>> 2657         list_for_each_entry_rcu(device, &fs_devices->devices, dev_list)
>> {
>> 2658                 if (device->bdev == bdev) {
>> 2659                         ret = -EEXIST;
>> 2660                         rcu_read_unlock();
>> 2661                         goto error;
>> 2662                 }
>> 2663         }
>>
> 
> This is on the init thread, this is just checking the fs_devices of /mnt/test,
> not the fs_devices of the RAID setup that we created, so this doesn't error out
> (nor should it) because we're adding it to our mounted fs.
> 
>>
>>> This is fine in general, because when we open the devices we check the
>>> UUID, and properly skip using the device that we added to the other file
>>> system.  However we do not remove it from our fs_devices,
>>
>> Hmm, context/thread is missing. Like, is it during device add or during
>> mkfs/dev-scan?
>>
>> AFAIK btrfs_free_stale_devices() checks and handles the removing of stale
>> devices in the fs_devices in both the contexts/threads device-add, mkfs
>> (device-scan).
>>
>>   For example:
>>
>> $ alias cnt_free_stale_devices="bpftrace -e 'kprobe:btrfs_free_stale_devices
>> { @ = count(); }'"
>>
>> New FSID on 2 devices, we call free_stale_devices():
>>
>> $ cnt_free_stale_devices -c 'mkfs.btrfs -fq -draid1 -mraid1 /dev/vg/scratch0
>> /dev/vg/scratch1'
>> Attaching 1 probe...
>>
>> @: 2
>>
>>   We do it only when there is a new fsid/device added to the fs_devices.
>>
>> For example:
>>
>> Clean up the fs_devices:
>> $ cnt_free_stale_devices -c 'btrfs dev scan --forget'
>> Attaching 1 probe...
>>
>> @: 1
>>
>> Now mounting devices are new to the fs_devices list, so call
>> free_stale_devices().
>>
>> $ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0
>> /dev/vg/scratch1 /btrfs'
>> Attaching 1 probe...
>>
>> @: 2
>>
>> $ umount /btrfs
>>
>> Below we didn't call free_stale_devices() because these two devices are
>> already in the appropriate fs_devices.
>>
>> $ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0
>> /dev/vg/scratch1 /btrfs'
>> Attaching 1 probe...
>>
>> @: 0
>>
>> $
>>
>> To me, it looks to be working correctly.
>>
> 
> Yes it does work correctly, most of the time.  If you run it in a loop 500 times
> it'll fail, because _sometimes_ udev goes in and does teh btrfs device scan and
> changes the name of the device in the fs_devices for the RAID group.  So the
> btrfs_free_stale_devices() thing doesn't find the exising device, because it's
> just looking at the device->name, which is different from the device we're
> adding.
> 
> We have the fs_devices for the RAID fs, and instead of /dev/vg0/lv2, we have
> /dev/dm-4 or whatever.  So we do the addition of /dev/vg0/lv2, go to find it in
> any other fs_devices, and don't find it because strcmp("/dev/vg0/lv2",
> "/dev/dm0-4") != 0, and thus leave the device on the fs_devices for the RAID
> file system.
> 

  I got it. It shouldn't be difficult to reproduce and, I could 
reproduce. Without this patch.


  Below is a device with two different paths. dm and its mapper.

----------
  $ ls -bli /dev/mapper/vg-scratch1  /dev/dm-1
  561 brw-rw---- 1 root disk 252, 1 Dec  3 12:13 /dev/dm-1
  565 lrwxrwxrwx 1 root root      7 Dec  3 12:13 /dev/mapper/vg-scratch1 
-> ../dm-1
----------

  Clean the fs_devices.

----------
  $ btrfs dev scan --forget
----------

  Use the mapper to do mkfs.btrfs.

----------
  $ mkfs.btrfs -fq /dev/mapper/vg-scratch0
  $ mount /dev/mapper/vg-scratch0 /btrfs
----------

  Crete raid1 again using mapper path.

----------
  $ mkfs.btrfs -U $uuid -fq -draid1 -mraid1 /dev/mapper/vg-scratch1 
/dev/mapper/vg-scratch2
----------

  Use dm path to add the device which belongs to another btrfs filesystem.

----------
  $ btrfs dev add -f /dev/dm-1 /btrfs
----------

  Now mount the above raid1 in degraded mode.

----------
  $ mount -o degraded /dev/mapper/vg-scratch2 /btrfs1
----------


Before patch:

  The /dev/mapper/vg-scratch1 is not shown as missing in the /btrfs1.

----------
  $ btrfs fi show -m /btrfs1
  Label: none  uuid: 12345678-1234-1234-1234-123456789abc
	Total devices 2 FS bytes used 704.00KiB
	devid    1 size 10.00GiB used 1.26GiB path /dev/mapper/vg-scratch1
	devid    2 size 10.00GiB used 2.54GiB path /dev/mapper/vg-scratch2

$ btrfs fi show -m /btrfs
Label: none  uuid: e89a49c3-66a2-473c-aadf-3bf23d37a3a3
	Total devices 2 FS bytes used 192.00KiB
	devid    1 size 10.00GiB used 536.00MiB path /dev/mapper/vg-scratch0
	devid    2 size 10.00GiB used 0.00B path /dev/mapper/vg-scratch1
----------

  The kernel does set the devid-1 as missing.
  (Checked using boilerplate code, which dumps fs_devices).

----------
  $ cat /proc/fs/btrfs/devlist
::
[fsid: 12345678-1234-1234-1234-123456789abc]
::
		device:		/dev/dm-1  <---
		devid:		1
::
		dev_state:	|IN_FS_METADATA|MISSING|dev_stats_valid
----------


After patch:

------------
$ btrfs fi show /btrfs1
Label: none  uuid: 12345678-1234-1234-1234-123456789abc
	Total devices 2 FS bytes used 128.00KiB
	devid    2 size 10.00GiB used 1.26GiB path /dev/mapper/vg-scratch2
	*** Some devices missing

$ btrfs fi show /btrfs
Label: none  uuid: 761115c2-7fc2-4dc8-afab-baf2509ea6fe
	Total devices 2 FS bytes used 192.00KiB
	devid    1 size 10.00GiB used 536.00MiB path /dev/mapper/vg-scratch0
	devid    2 size 10.00GiB used 0.00B path /dev/mapper/vg-scratch1
------------

------------
$ cat /proc/fs/btrfs/devlist
::
[fsid: 12345678-1234-1234-1234-123456789abc]
::
		device:		(null)  <---
		devid:		1
::
		dev_state:	|IN_FS_METADATA|MISSING|dev_stats_valid
-------------



>>> so when we get
>>> the device info we still see this old device and think that everything
>>> is ok.
>>
>>
>>> We have a check for -ENODATA coming back from reading the super off of
>>> the device to catch the wipefs case, but we don't catch literally any
>>> other error where the device is no longer valid and thus do not remove
>>> the device.
>>>
>>
>>> Fix this to not special case an empty device when reading the super
>>> block, and simply remove any device we are unable to open.
>>>
>>> With this fix we properly print out missing devices in the test case,
>>> and after 500 iterations I'm no longer able to reproduce the problem,
>>> whereas I could usually reproduce within 200 iterations.
>>>
>>
>> commit 7f551d969037 ("btrfs: free alien device after device add")
>> fixed the case we weren't freeing the stale device in the device-add
>> context.
>>
>> However, here I don't understand the thread/context we are missing to free
>> the stale device here.
>>
>>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>>> ---
>>>    fs/btrfs/disk-io.c | 2 +-
>>>    fs/btrfs/volumes.c | 2 +-
>>>    2 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 5c598e124c25..fa34b8807f8d 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -3924,7 +3924,7 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev,
>>>    	super = page_address(page);
>>>    	if (btrfs_super_magic(super) != BTRFS_MAGIC) {
>>>    		btrfs_release_disk_super(super);
>>> -		return ERR_PTR(-ENODATA);
>>> +		return ERR_PTR(-EINVAL);
>>>    	}
>>
>>   I think you are removing ENODATA because it is going away in the parent
>> function. And, we don't reach this condition in the test case btrfs/197.
>>   Instead, we fail here at line 645 below and, we return -EINVAL;
> 
> I'm changing it to be consistent, instead of special casing this one specific
> failure, just treat all failures like we need to remove the device, and thus we
> can just make this be EINVAL as well.

  Hmm, as this change is not related to the bug fix here. Can it be a 
separate patch? The actual bug fix has to go to stable kernels as well.

> 
>>
>>   645         if (memcmp(device->uuid, disk_super->dev_item.uuid,
>> BTRFS_UUID_SIZE))
>>   646                 goto error_free_page;
>>   647
>>
>>   687 error_free_page:
>>   688         btrfs_release_disk_super(disk_super);
>>   689         blkdev_put(bdev, flags);
>>   690
>>   691         return -EINVAL;
>>
>> function stack carries the return code to the parent open_fs_devices().
>>
>> open_fs_devices()
>>   btrfs_open_one_device()
>>    btrfs_get_bdev_and_sb()
>>     btrfs_read_dev_super()
>>      btrfs_read_dev_one_super()
>>
>>
>>
>>>    	if (btrfs_super_bytenr(super) != bytenr_orig) {
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index f38c230111be..890153dd2a2b 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -1223,7 +1223,7 @@ static int open_fs_devices(struct btrfs_fs_devices *fs_devices,
>>>    		if (ret == 0 &&
>>>    		    (!latest_dev || device->generation > latest_dev->generation)) {
>>>    			latest_dev = device;
>>> -		} else if (ret == -ENODATA) {
>>> +		} else if (ret) {
>>>    			fs_devices->num_devices--;
>>>    			list_del(&device->dev_list);
>>>    			btrfs_free_device(device);
>>>
>>
>> There are many other cases, for example including -EACCES (from
>> blkdev_get_by_path()) (I haven't checked other error codes). For which,
>> calling btrfs_free_device() is inappropriate.
> 
> Yeah I was a little fuzzy on this.  I think *any* failure should mean that we
> remove the device from the fs_devices tho right?  So that we show we're missing
> a device, since we can't actually access it?  I'm actually asking, because I
> think we can go either way, but to me I think any failure sure result in the
> removal of the device so we can re-scan the correct one.  Thanks,
> 

It is difficult to generalize, I guess. For example, consider the 
transient errors during the boot-up and the errors due to slow to-ready 
devices or the system-related errors such as ENOMEM/EACCES, all these 
does not call for device-free. If we free the device for transient 
errors, any further attempt to mount will fail unless it is device-scan 
again.

Here the bug is about btrfs_free_stale_devices() which failed to 
identify the same device when tricked by mixing the dm and mapper paths.
Can I check with you if there is another way to fix this by checking the 
device major and min number or the serial number from the device inquiry 
page?

Thanks, Anand

> Josef
> 
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic