'Re: Questions about BTRFS balance and scrub on non-RAID setup'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-btrfs
Subject:    Re: Questions about BTRFS balance and scrub on non-RAID setup
From:       Lionel Bouton <lionel-subscription () bouton ! name>
Date:       2021-08-30 14:18:32
Message-ID: 04941c75-3ea5-32de-5978-efe5c5681ee2 () bouton ! name
[Download RAW message or body]

Hi,

Le 30/08/2021 Ă  15:20, Andrej Friesen a ĂŠcrit  :
> {...]
> Use case and context for my questions:
>
> A file system as a service for our customers.
> This will be offered to the customer as a network share via NFS. That
> also means we do not have any control over the usage patterns.
> No idea about how often, how much they write small or big files to
> that file system.
>
> Technically we only create one block device with several terabytes and
> format this with btrfs. The actual block device which we format is
> backed by a ceph cluster.
> So the actual block device is already been on a distributed storage,
> therefore we will not do any raid configuration.
>
> The kernel will be a recent 5.10.
>
> Scrub:
>
> Do I need to regularly scrub?
> If so, what would be a recommendation for my use case?
>
> My conclusion after reading about the scrub. This checks for damaged
> data and will recover the data if this filesystem has another copy of
> that data.
> Since we will run without raid in btrfs this is not needed in my opinion.
> Am I right with my conclusion here?

Partially. Ceph replication/scrub/repair will cover individual disk/OSD
server faults but not faults at the origin of the data being stored.

We provide the same service for a customer. Several years ago the VM
hosting the NFS server for this customer ran on hardware that developed
a fault, the result was silent corruption of the data written by the NFS
server *before* being handed to Ceph for storage (probably memory or CPU
related, we threw the server out of the cluster and never looked back...).
- ceph scrubbing was of no use there because from its point of view the
replicated blocks were all fine.
- we launch btrfs scrub monthly by default and this is how we detected
the corruption.

We make regular rbd snapshots so we could :
- switch the NFS server to an existing read-only replica (that could not
be corrupted by the same fault as it was replicated using simple
file-level content synchronization),
- restart the original NFS server using the last known good snapshot,
- rsync fresh data from the replica to the original server to catch up,
- switch back.

IIRC I've seen posts here about more checks done in the write path to
catch corruption but even if the likelihood of such corruption is lower
with recent kernels, hardware faults happen and software solutions can't
fully cover for them. Being able to catch corruption after the fact
relatively early makes recovery simpler and faster so I would only
disable scrubs on disposable data. Imagine discovering corruption when
you reboot your NFS server and the filesystem refuses to mount...

>
> Balance:
>
> Do I need to regularly balance my filesystem?
> If so, what would be a recommendation for my use case?

Full balance is probably overkill in any situation and can sunk your I/O
bandwidth. With recent kernels it seems there is less need for
balancing. We still use an automatic balancing script that tries to
limit the amount of free space allocated to nearly empty allocation
groups (by using "usage=50+" filters) and cancels the balance if it is
too long (to avoid limiting IO performance for too long, waiting for a
next call to continue) but I'm not sure if it's still worth it. In our
case we have been bitten by out of space situations with old kernels
brought by over-allocation of free space due to temporary large space
usages so we consider it an additional safeguard.

You probably want to use autodefrag or a custom defragmentation solution
too. We weren't satisfied with autodefrag in some situations (were
clearly fragmentation crept in and IO performance suffered until a
manual defrag) and developed our own scheduler for triggering
defragmentation based on file writes and slow full filesystem scans,
using filefrag to estimate the fragmentation cost file by file.

Best regards,

Lionel
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic