[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-users
Subject:    Re: [ceph-users] RGW - Multisite setup -> question about Bucket - Sharding, limitations and synchron
From:       Eric Ivancich <ivancich () redhat ! com>
Date:       2019-07-31 14:11:17
Message-ID: 5E49EC26-95B5-4641-B9FF-62F6B75434C0 () redhat ! com
[Download RAW message or body]


> On Jul 30, 2019, at 7:49 AM, Mainor Daly <ceph@renitent-und-betrunken.de> wrote:
> 
> Hello,
> 
> (everything in context of S3)
> 
> 
> I'm currently trying to better understand bucket sharding in combination with an \
> multisite - rgw setup and possible limitations. 
> At the moment I understand that a bucket has a bucket index, which is a list of \
> objects within the bucket. 
> There are also indexless buckets, but those are not usable for cases like a \
> multisite - rgw bucket, where your need a [delayed] consistent relation/state \
> between bucket n [zone a] and bucket n [zone b]. 
> Those bucket indexes are stored in "shards" and shards get distributed over to \
> whole zone - cluster for scaling purposes. Redhat recommends a maximum size of \
> 102,400 object per shard and recommend this forumular to determine the right shard \
> size for a cluster: 
> number of objects expected in a bucket / 100,000 
> max number of supported shards (or tested limit) is 7877 shard.

Back in 2017 this maximum number of shards changed to 65521. This change is in \
luminous, mimic, and nautilus.

> That results in a total limit of 787.700.000 objects, as long you wanna stay in \
> known and tested water. 
> Now some the things I did not 100% understand:
> 
> = QUESTION 1 =
> 
> Does each bucket has it's own shards? E.g
> 
> Bucket 1 reached it's shard limit at 7877 shard, can i then create other  Buckets \
> wish start with their own frish sets of shards? OR is it the other way around which \
> would mean all buckets save their Index in the the same shards and if i reach the \
> shard limit I need to create a second cluster?

Correct, each bucket has its own bucket index. And each bucket index can be sharded.

> = QUESTION 2 =
> How are this shards distrbuted over the cluster? I expect they are just objects in \
> the rgw.bucket.index pool, is that correct? So. those one:
> rados ls -p a.rgw.buckets.index 
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.274451.1
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.87683.1
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.64716.1
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.78046.2

They are just objects and distributed via the CRUSH algorithm.

> = QUESTION 3 = 
> 
> 
> Does this Bucket Index Shards, has any relation to the RGW Sync shards in a rgw \
> multisite setup? E.g. If I have a ton of bucket index shards or buckets, does it \
> have any impact on the sync shards? 

They're separate.

> radosgw-admin sync status
> realm f0019e09-c830-4fe8-a992-435e6f463b7c (mumu_1)
> zonegroup 307a1bb5-4d93-4a01-af21-0d8467b9bdfe (EU_1)
> zone 5a9c4d16-27a6-4721-aeda-b1a539b3d73a (b)
> metadata sync syncing
> full sync: 0/64 shards                    <= this ones I mean
> incremental sync: 64/64 shards
> metadata is caught up with master
> data sync source: 3638e3a4-8dde-42ee-812a-f98e266548a4 (a)
> syncing
> full sync: 0/128 shards   <= and this ones
> incremental sync: 128/128 shards <= and this ones
> data is caught up with source
> 
> 
> (swi to sync shard related topics)
> = QUESTION 4 = 
> (switching to sync shard related topics)
> 
> 
> What is the exact function and purpose of the sync shards? Do they implement any \
> limit? E.g. maybe a maximum amount of objects entries that waits for \
> synchronization to zone b.

They contain logs of items that need to be synced between zones. RGWs will look at \
them and sync objects. These logs are sharded so different RGWs can take on different \
shards and work on syncing in parallel.

> = QUESTION 5 = 
> Are those  Sync Shards processed parallel or sequentially? And where are those \
> shards stored?

They're sharded to allow parallelism. At any given moment, each shard is claimed by \
(locked by) one RGW. And each RGW may be claiming multiple shards. Collectively, all \
RGW are claiming all shards. Each RGW is syncing multiple shards in parallel and all \
RGWs are doing this in parallel. So in some sense there are two levels of \
parallelism.

> = QUESTION 6 = 
> As far as I have experienced the sync process pretty much works like that:
> 
> 1.) The client sends a object or a operation to a rados gateway A (RGW A)
> 2.) RGW A logs this operation into one of it's sync shards and execute the \
> operation to it's local storage pool 3.) RGW B checks via get requests in a regular \
> intervall if any new entries in the RGW A log appears  4.) If a new entry exists \
> RGW B it's execute the operation to it's local pool or pulls the new object from \
> RGW A 
> Did I understand that correct? (For my roughly description of this functionality, I \
> want to apologize at the developers which for sure invested much time and effort \
> into design and building of that sync - process)

That's about right.

> And If I understand it correct, how would look the exact strategy in a multisite - \
> setup to resync e.g. a single bucket at which one zone got corrupted and must be \
> get back into a synchronous state?

Be aware that there are full syncs and incremental syncs. Full syncs just copy every \
object. Incremental syncs use logs to sync selectively. Perhaps Casey will weigh in \
and discuss the state transitions.

> Hope thats the correct place to ask such questions.
> 
> Best Regards,
> Daly


--
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic