[prev in list] [next in list] [prev in thread] [next in thread]
List: ceph-users
Subject: [ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'
From: Eugen Block <eblock () nde ! ag>
Date: 2024-02-28 20:50:17
Message-ID: 20240228205017.Horde.zSuOnOBgAhT_BkDkjx9rQQ5 () webmail ! nde ! ag
[Download RAW message or body]
Hi,
great that you found a solution. Maybe that also helps to get rid of
the cache-tier entirely?
Zitat von Cedric <yipikai7@gmail.com>:
> Hello,
>
> Sorry for the late reply, so yes we finally find a solution, which
> was to split apart the cache_pool on dedicated OSD. It had the
> effect to clear off slow ops and allow the cluster to serves clients
> again, after 5 days of lock down, hopefully the majority of VM
> resume well, thanks to the virtio driver that does not seems to have
> any timeout.
>
> It seems that at least one of the main culprit was to store both
> cold and hot data pool on same OSD (which in the end totally make
> sens), maybe some others actions engaged also had an effect, we are
> still trying to trouble shoot the root of slow ops, weirdly it was
> the 5th cluster upgraded and all as almost the same configuration,
> but this one handles 5x time more workload.
>
> In the hope it could help.
>
> Cédric
>
>> On 26 Feb 2024, at 10:57, Eugen Block <eblock@nde.ag> wrote:
>>
>> Hi,
>>
>> thanks for the context. Was there any progress over the weekend?
>> The hanging commands seem to be MGR related, and there's only one
>> in your cluster according to your output. Can you deploy a second
>> one manually, then adopt it with cephadm? Can you add 'ceph
>> versions' as well?
>>
>>
>> Zitat von florian.leduc@socgen.com:
>>
>>> Hi,
>>> A bit of history might help to understand why we have the cache tier.
>>>
>>> We run openstack on top ceph since many years now (started with
>>> mimic, then an upgrade to nautilus (years 2 ago) and today and
>>> upgrade to pacific). At the beginning of the setup, we used to
>>> have a mix of hdd+ssd devices in HCI mode for openstack nova.
>>> After the upgrade to nautilus, we made a hardware refresh with
>>> brand new NVME devices. And transitionned from mixed devices to
>>> nvme. But we were never able to evict all the data from the
>>> vms_cache pools (even with being aggressive with the eviction; the
>>> last resort would have been to stop all the virtual instances, and
>>> that was not an option for our customers), so we decided to move
>>> on and set cache-mode proxy and serve data with only nvme since
>>> then. And it's been like this for 1 years and a half.
>>>
>>> But today, after the upgrade, the situation is that we cannot
>>> query any stats (with ceph pg x.x query), rados query hangs, scrub
>>> hangs even though all PGs are "active+clean". and there is no
>>> client activity reported by the cluster. Recovery, and rebalance.
>>> Also some other commands hangs, ie: "ceph balancer status".
>>>
>>> --------------
>>> bash-4.2$ ceph -s
>>> cluster:
>>> id: <fsid>
>>> health: HEALTH_WARN
>>> mon is allowing insecure global_id reclaim
>>> noscrub,nodeep-scrub,nosnaptrim flag(s) set
>>> 18432 slow ops, oldest one blocked for 7626 sec,
>>> daemons
>>> [osd.0,osd.1,osd.10,osd.11,osd.112,osd.113,osd.118,osd.119,osd.120,osd.122]... have slow
>>> ops.
>>>
>>> services:
>>> mon: 3 daemons, quorum mon1,mon2,mon3(age 36m)
>>> mgr: bm9612541(active, since 39m)
>>> osd: 72 osds: 72 up (since 97m), 72 in (since 9h)
>>> flags noscrub,nodeep-scrub,nosnaptrim
>>>
>>> data:
>>> pools: 8 pools, 2409 pgs
>>> objects: 14.64M objects, 92 TiB
>>> usage: 276 TiB used, 143 TiB / 419 TiB avail
>>> pgs: 2409 active+clean
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-leave@ceph.io
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-leave@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic