'Re: ceph on non-btrfs file systems'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-devel
Subject:    Re: ceph on non-btrfs file systems
From:       Christian Brunner <chb () muc ! de>
Date:       2011-10-24 16:22:42
Message-ID: CAO47_-9jp===DT=scpe=U8BnPnUCAVz7xUWVCC9AMVmx67CdaA () mail ! gmail ! com
[Download RAW message or body]

Thanks for explaining this. I don't have any objections against btrfs
as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
scare me, since I can use the ceph replication to recover a lost
btrfs-filesystem. The only problem I have is, that btrfs is not stable
on our side and I wonder what you are doing to make it work. (Maybe
it's related to the load pattern of using ceph as a backend store for
qemu).

Here is a list of the btrfs problems I'm having:

- When I run ceph with the default configuration (btrfs snaps enabled)
I can see a rapid increase in Disk-I/O after a few hours of uptime.
Btrfs-cleaner is using more and more time in
btrfs_clean_old_snapshots().
- When I run ceph with btrfs snaps disabled, the situation is getting
slightly better. I can run an OSD for about 3 days without problems,
but then again the load increases. This time, I can see that the
ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
than usual.

Another thing is that I'm seeing a WARNING: at fs/btrfs/inode.c:2114
from time to time. Maybe it's related to the performance issues, but
seems to be able to verify this.

It's really sad to see, that ceph performance and stability is
suffering that much from the underlying filesystems and that this
hasn't changed over the last months.

Kind regards,
Christian

2011/10/24 Sage Weil <sage@newdream.net>:
> Although running on ext4, xfs, or whatever other non-btrfs you want mostly
> works, there are a few important remaining issues:
>
> 1- ext4 limits total xattrs for 4KB.  This can cause problems in some
> cases, as Ceph uses xattrs extensively.  Most of the time we don't hit
> this.  We do hit the limit with radosgw pretty easily, though, and may
> also hit it in exceptional cases where the OSD cluster is very unhealthy.
>
> There is a large xattr patch for ext4 from the Lustre folks that has been
> floating around for (I think) years.  Maybe as interest grows in running
> Ceph on ext4 this can move upstream.
>
> Previously we were being forgiving about large setxattr failures on ext3,
> but we found that was leading to corruption in certain cases (because we
> couldn't set our internal metadata), so the next release will assert/crash
> in that case (fail-stop instead of fail-maybe-eventually-corrupt).
>
> XFS does not have an xattr size limit and thus does have this problem.
>
> 2- The other problem is with OSD journal replay of non-idempotent
> transactions.  On non-btrfs backends, the Ceph OSDs use a write-ahead
> journal.  After restart, the OSD does not know exactly which transactions
> in the journal may have already been committed to disk, and may reapply a
> transaction again during replay.  For most operations (write, delete,
> truncate) this is fine.
>
> Some operations, though, are non-idempotent.  The simplest example is
> CLONE, which copies (efficiently, on btrfs) data from one object to
> another.  If the source object is modified, the osd restarts, and then
> the clone is replayed, the target will get incorrect (newer) data.  For
> example,
>
> 1- clone A -> B
> 2- modify A
>   <osd crash, replay from 1>
>
> B will get new instead of old contents.
>
> (This doesn't happen on btrfs because the snapshots allow us to replay
> from a known consistent point in time.)
>
> For things like clone, skipping the operation of the target exists almost
> works, except for cases like
>
> 1- clone A -> B
> 2- modify A
> ...
> 3- delete B
>   <osd crash, replay from 1>
>
> (Although in that example who cares if B had bad data; it was removed
> anyway.)  The larger problem, though, is that that doesn't always work;
> CLONERANGE copies a range of a file from A to B, where B may already
> exist.
>
> In practice, the higher level interfaces don't make full use of the
> low-level interface, so it's possible some solution exists that careful
> avoids the problem with a partial solution in the lower layer.  This makes
> me nervous, though, as it is easy to break.
>
> Another possibility:
>
>  - on non-btrfs, we set a xattr on every modified object with the
>   op_seq, the unique sequence number for the transaction.
>  - for any (potentially) non-idempotent operation, we fsync() before
>   continuing to the next transaction, to ensure that xattr hits disk.
>  - on replay, we skip a transaction if the xattr indicates we already
>   performed this transaction.
>
> Because every 'transaction' only modifies on a single object (file),
> this ought to work.  It'll make things like clone slow, but let's face it:
> they're already slow on non-btrfs file systems because they actually copy
> the data (instead of duplicating the extent refs in btrfs).  And it should
> make the full ObjectStore iterface safe, without upper layers having to
> worry about the kinds and orders of transactions they perform.
>
> Other ideas?
>
> This issue is tracked at http://tracker.newdream.net/issues/213.
>
> sage
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic