[prev in list] [next in list] [prev in thread] [next in thread]
List: ceph-users
Subject: [ceph-users] Re: ceph orch commands stuck
From: Burkhard Linke <Burkhard.Linke () computational ! bio ! uni-giessen ! de>
Date: 2021-08-30 14:20:51
Message-ID: 48a07b81-941b-8e89-281e-351872416573 () computational ! bio ! uni-giessen ! de
[Download RAW message or body]
Hi,
On 30.08.21 15:36, Oliver Weinmann wrote:
> Hi,
>
>
>
> we had one failed osd in our cluster that we have replaced. Since then
> the cluster is behaving very strange and some ceph commands like ceph
> crash or ceph orch are stuck.
Just two unrelated thoughts:
- never use two mons. If one of them fails for whatever reason, your
whole cluster will stop working. A quorum always requires _more_ than
half of the members. Use at least three mons for anything productive (or
five for the paranoid ones).
- this might be debatable....but do not use a cluster network in such a
tiny cluster. It makes deployment a lot more complex without a
significant advantage. Keep it simple. Use a LACP bond covering both
interfaces if possible.
And ontopic:
- find out which daemons have crashed
- you can try to reduce the size of the mon stores by manual compaction
(don't know how to do this in a container setup...)
- consult the mon logs for hints why the store is growing
Regards,
Burkhard
>
>
>
> Cluster health:
>
>
>
> [root@gedasvl98 ~]# ceph -s
> cluster:
> id: ec9e031a-cd10-11eb-a3c3-005056b7db1f
> health: HEALTH_WARN
> mons gedaopl03,gedasvl98 are using a lot of disk space
> mon gedasvl98 is low on available space
> 2 daemons have recently crashed
> 911 slow ops, oldest one blocked for 62 sec, daemons
> [mon.gedaopl03,mon.gedasvl98] have slow ops.
>
> services:
> mon: 2 daemons, quorum gedasvl98,gedaopl03 (age 27m)
> mgr: gedaopl01.fjpsnc(active, since 44m), standbys: gedaopl03.japugq
> mds: 1/1 daemons up, 1 standby
> osd: 9 osds: 9 up (since 27m), 9 in (since 2h)
>
> data:
> volumes: 1/1 healthy
> pools: 10 pools, 289 pgs
> objects: 7.19k objects, 39 GiB
> usage: 118 GiB used, 7.7 TiB / 7.8 TiB avail
> pgs: 289 active+clean
>
> io:
> client: 170 B/s rd, 170 B/s wr, 0 op/s rd, 0 op/s wr
>
>
>
> If I understand correctly the reason for the mon containers using a
> lot of disk space could be due to the failed osd and unclean pgs. The
> pgs are clean and so I would expect the mons to free up disk space
> again. I have also restarted the active and passive mons, but no
> change here. Then I remembered that I recently changed the ips of the
> ceph nodes using:
>
>
>
> ceph orch host set-addr gedaopl01 192.168.30.200
> ceph orch host set-addr gedaopl02 192.168.30.201
> ceph orch host set-addr gedaopl03 192.168.30.202
>
>
>
> This was mainly because I think I got it all wrong in the first place
> deploying the cluster using cephadm. Our nodes have 3 network ports:
>
>
>
> 1 x 1GB public network 172.28.4.x (used for OS deployment etc.)
>
> 1 x 10GB ceph cluster network 192.168.41.x
>
> 1 x 10GB ceph public network 192.168.30.x
>
>
>
> If I understood correctly the IP of the mons should be one in the
> public network (192.168.30.x). Maybe the changes I made have caused
> this trouble?
>
>
>
> Best Regards,
>
> Oliver
>
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-leave@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic