'[ceph-users] Re: ceph orch commands stuck'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-users
Subject:    [ceph-users] Re: ceph orch commands stuck
From:       Burkhard Linke <Burkhard.Linke () computational ! bio ! uni-giessen ! de>
Date:       2021-08-30 14:20:51
Message-ID: 48a07b81-941b-8e89-281e-351872416573 () computational ! bio ! uni-giessen ! de
[Download RAW message or body]

Hi,

On 30.08.21 15:36, Oliver Weinmann wrote:
> Hi,
>
>
>
> we had one failed osd in our cluster that we have replaced. Since then 
> the cluster is behaving very strange and some ceph commands like ceph 
> crash or ceph orch are stuck.


Just two unrelated thoughts:


- never use two mons. If one of them fails for whatever reason, your 
whole cluster will stop working. A quorum always requires _more_ than 
half of the members. Use at least three mons for anything productive (or 
five for the paranoid ones).


- this might be debatable....but do not use a cluster network in such a 
tiny cluster. It makes deployment a lot more complex without a 
significant advantage. Keep it simple. Use a LACP bond covering both 
interfaces if possible.


And ontopic:

- find out which daemons have crashed

- you can try to reduce the size of the mon stores by manual compaction 
(don't know how to do this in a container setup...)

- consult the mon logs for hints why the store is growing


Regards,

Burkhard

>
>
>
> Cluster health:
>
>
>
> [root@gedasvl98 ~]# ceph -s
>   cluster:
>     id:     ec9e031a-cd10-11eb-a3c3-005056b7db1f
>     health: HEALTH_WARN
>             mons gedaopl03,gedasvl98 are using a lot of disk space
>             mon gedasvl98 is low on available space
>             2 daemons have recently crashed
>             911 slow ops, oldest one blocked for 62 sec, daemons 
> [mon.gedaopl03,mon.gedasvl98] have slow ops.
>
>   services:
>     mon: 2 daemons, quorum gedasvl98,gedaopl03 (age 27m)
>     mgr: gedaopl01.fjpsnc(active, since 44m), standbys: gedaopl03.japugq
>     mds: 1/1 daemons up, 1 standby
>     osd: 9 osds: 9 up (since 27m), 9 in (since 2h)
>
>   data:
>     volumes: 1/1 healthy
>     pools:   10 pools, 289 pgs
>     objects: 7.19k objects, 39 GiB
>     usage:   118 GiB used, 7.7 TiB / 7.8 TiB avail
>     pgs:     289 active+clean
>
>   io:
>     client:   170 B/s rd, 170 B/s wr, 0 op/s rd, 0 op/s wr
>
>
>
> If I understand correctly the reason for the mon containers using a 
> lot of disk space could be due  to the failed osd and unclean pgs. The 
> pgs are clean and so I would expect the mons to free up disk space 
> again. I have also restarted the active and passive mons, but no 
> change here. Then I remembered that I recently changed the ips of the 
> ceph nodes using:
>
>
>
> ceph orch host set-addr gedaopl01 192.168.30.200
> ceph orch host set-addr gedaopl02 192.168.30.201
> ceph orch host set-addr gedaopl03 192.168.30.202
>
>
>
> This was mainly because I think I got it all wrong in the first place 
> deploying the cluster using cephadm. Our nodes have 3 network ports:
>
>
>
> 1 x 1GB public network 172.28.4.x (used for OS deployment etc.)
>
> 1 x 10GB ceph cluster network 192.168.41.x
>
> 1 x 10GB ceph public network 192.168.30.x
>
>
>
> If I understood correctly the IP of the mons should be one in the 
> public network (192.168.30.x). Maybe the changes I made have caused 
> this trouble?
>
>
>
> Best Regards,
>
> Oliver
>
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-leave@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic