'[ceph-users] RGW Multisite metadata sync init'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-users
Subject:    [ceph-users] RGW Multisite metadata sync init
From:       drakonstein () gmail ! com (David Turner)
Date:       2017-08-31 16:46:00
Message-ID: CAN-Gep+RaB=jmshJ32q3=tyxz6r_JJ3Y9CDGWBSA+0KtvUw+_g () mail ! gmail ! com
[Download RAW message or body]

All of the messages from sync error list are listed below.  The number on
the left is how many times the error message is found.

   1811                     "message": "failed to sync bucket instance:
(16) Device or resource busy"
      7                     "message": "failed to sync bucket instance: (5)
Input\/output error"
     65                     "message": "failed to sync object"

On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman <owasserm at redhat.com> wrote:

> 
> Hi David,
> 
> On Mon, Aug 28, 2017 at 8:33 PM, David Turner <drakonstein at gmail.com>
> wrote:
> 
> > The vast majority of the sync error list is "failed to sync bucket
> > instance: (16) Device or resource busy".  I can't find anything on Google
> > about this error message in relation to Ceph.  Does anyone have any idea
> > what this means? and/or how to fix it?
> > 
> 
> Those are intermediate errors resulting from several radosgw trying to
> acquire the same sync log shard lease. It doesn't effect the sync progress.
> Are there any other errors?
> 
> Orit
> 
> > 
> > On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley <cbodley at redhat.com> wrote:
> > 
> > > Hi David,
> > > 
> > > The 'data sync init' command won't touch any actual object data, no.
> > > Resetting the data sync status will just cause a zone to restart a full
> > > sync of the --source-zone's data changes log. This log only lists which
> > > buckets/shards have changes in them, which causes radosgw to consider them
> > > for bucket sync. So while the command may silence the warnings about data
> > > shards being behind, it's unlikely to resolve the issue with missing
> > > objects in those buckets.
> > > 
> > > When data sync is behind for an extended period of time, it's usually
> > > because it's stuck retrying previous bucket sync failures. The 'sync error
> > > list' may help narrow down where those failures are.
> > > 
> > > There is also a 'bucket sync init' command to clear the bucket sync
> > > status. Following that with a 'bucket sync run' should restart a full sync
> > > on the bucket, pulling in any new objects that are present on the
> > > source-zone. I'm afraid that those commands haven't seen a lot of polish or
> > > testing, however.
> > > 
> > > Casey
> > > 
> > > On 08/24/2017 04:15 PM, David Turner wrote:
> > > 
> > > Apparently the data shards that are behind go in both directions, but
> > > only one zone is aware of the problem.  Each cluster has objects in their
> > > data pool that the other doesn't have.  I'm thinking about initiating a
> > > `data sync init` on both sides (one at a time) to get them back on the same
> > > page.  Does anyone know if that command will overwrite any local data that
> > > the zone has that the other doesn't if you run `data sync init` on it?
> > > 
> > > On Thu, Aug 24, 2017 at 1:51 PM David Turner <drakonstein at gmail.com>
> > > wrote:
> > > 
> > > > After restarting the 2 RGW daemons on the second site again, everything
> > > > caught up on the metadata sync.  Is there something about having 2 RGW
> > > > daemons on each side of the multisite that might be causing an issue with
> > > > the sync getting stale?  I have another realm set up the same way that is
> > > > having a hard time with its data shards being behind.  I haven't told them
> > > > to resync, but yesterday I noticed 90 shards were behind.  It's caught back
> > > > up to only 17 shards behind, but the oldest change not applied is 2 months
> > > > old and no order of restarting RGW daemons is helping to resolve this.
> > > > 
> > > > On Thu, Aug 24, 2017 at 10:59 AM David Turner <drakonstein at gmail.com>
> > > > wrote:
> > > > 
> > > > > I have a RGW Multisite 10.2.7 set up for bi-directional syncing.  This
> > > > > has been operational for 5 months and working fine.  I recently created a
> > > > > new user on the master zone, used that user to create a bucket, and put in
> > > > > a public-acl object in there.  The Bucket created on the second site, but
> > > > > the user did not and the object errors out complaining about the access_key
> > > > > not existing.
> > > > > 
> > > > > That led me to think that the metadata isn't syncing, while bucket and
> > > > > data both are.  I've also confirmed that data is syncing for other buckets
> > > > > as well in both directions. The sync status from the second site was this.
> > > > > 
> > > > > 
> > > > > 1.
> > > > > 
> > > > > metadata sync syncing
> > > > > 
> > > > > 2.
> > > > > 
> > > > > full sync: 0/64 shards
> > > > > 
> > > > > 3.
> > > > > 
> > > > > incremental sync: 64/64 shards
> > > > > 
> > > > > 4.
> > > > > 
> > > > > metadata is caught up with master
> > > > > 
> > > > > 5.
> > > > > 
> > > > > data sync source: f4c12327-4721-47c9-a365-86332d84c227 (public-atl01)
> > > > > 
> > > > > 6.
> > > > > 
> > > > > syncing
> > > > > 
> > > > > 7.
> > > > > 
> > > > > full sync: 0/128 shards
> > > > > 
> > > > > 8.
> > > > > 
> > > > > incremental sync: 128/128 shards
> > > > > 
> > > > > 9.
> > > > > 
> > > > > data is caught up with source
> > > > > 
> > > > > 
> > > > > 
> > > > > Sync status leads me to think that the second site believes it is up
> > > > > to date, even though it is missing a freshly created user.  I restarted all
> > > > > of the rgw daemons for the zonegroup, but it didn't trigger anything to fix
> > > > > the missing user in the second site.  I did some googling and found the
> > > > > sync init commands mentioned in a few ML posts and used metadata sync init
> > > > > and now have this as the sync status.
> > > > > 
> > > > > 
> > > > > 1.
> > > > > 
> > > > > metadata sync preparing for full sync
> > > > > 
> > > > > 2.
> > > > > 
> > > > > full sync: 64/64 shards
> > > > > 
> > > > > 3.
> > > > > 
> > > > > full sync: 0 entries to sync
> > > > > 
> > > > > 4.
> > > > > 
> > > > > incremental sync: 0/64 shards
> > > > > 
> > > > > 5.
> > > > > 
> > > > > metadata is behind on 70 shards
> > > > > 
> > > > > 6.
> > > > > 
> > > > > oldest incremental change not applied: 2017-03-01 21:13:43.0.126971s
> > > > > 
> > > > > 7.
> > > > > 
> > > > > data sync source: f4c12327-4721-47c9-a365-86332d84c227 (public-atl01)
> > > > > 
> > > > > 8.
> > > > > 
> > > > > syncing
> > > > > 
> > > > > 9.
> > > > > 
> > > > > full sync: 0/128 shards
> > > > > 
> > > > > 10.
> > > > > 
> > > > > incremental sync: 128/128 shards
> > > > > 
> > > > > 11.
> > > > > 
> > > > > data is caught up with source
> > > > > 
> > > > > 
> > > > > 
> > > > > It definitely triggered a fresh sync and told it to forget about what
> > > > > it's previously applied as the date of the oldest change not applied is the
> > > > > day we initially set up multisite for this zone.  The problem is that was
> > > > > over 12 hours ago and the sync stat hasn't caught up on any shards yet.
> > > > > 
> > > > > Does anyone have any suggestions other than blast the second site and
> > > > > set it back up with a fresh start (the only option I can think of at this
> > > > > point)?
> > > > > 
> > > > > Thank you,
> > > > > David Turner
> > > > > 
> > > > 
> > > 
> > > _______________________________________________
> > > ceph-users mailing listceph-users at \
> > > lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > > 
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users at lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > 
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20170831/c5359208/attachment.html>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic