'[ceph-users] Re: Limiting osd or buffer/cache memory with Pacific/cephadm?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-users
Subject:    [ceph-users] Re: Limiting osd or buffer/cache memory with Pacific/cephadm?
From:       Manuel Holtgrewe <zyklenfrei () gmail ! com>
Date:       2021-09-29 5:45:20
Message-ID: CACRvWAv8-RTUyfM_yFCKb6ZCFzujv+m6aHqLhzuNdoebhwpjZw () mail ! gmail ! com
[Download RAW message or body]

Dear Christian,

thank you for your reply. I think I found the cause of my problem (a leader
election loop with the leader being evicted from quorum and returning every
5s) and could resolve it by rebooting the host. I don't know whether that's
the root cause but I'm happy enough with the situation and could provide
more logs if that's helpful for others.

A bit of the history of the resolution:

I checked and indeed memory pressure appears to be not a problem, see below
for the output of a `tell ... heap stats` command. I'm running  a stock
CentOS 7 kernel (3.10). I have 35 osds in this server with 250GB of RAM in
total and ~150G are displayed as used so the ~100G of cache should not hurt
me, I understand now.

I had a closer look and also looked at the log output of the monitors. It
appears that the mon on my osd-2 server is regularly thrown out of quorum
and joins back in as the leader. Leader election is started every 5 seconds
or so. I'm seeing 100% CPU load for the mon processes. I dug around the
mailing list and issues and it appears that similar problems have occurred
in the past at different versions and apparently for different reasons.

I'm not seeing this problem on an identically built "mirror" cluster, which
also contradicts my original assumption of memory usage. The mirror system
does not have its file system mounted and no sync process runs at the
moment.

In the spirit of "have you tried turning it off and on again?", I rebooted
my osd-2 host and - the leader election loop disappers. There is only one
leader election when the host comes back and that's it. Wonderful.

Best wishes,
Manuel

--8<-- --8<-- mon.osd-2 log output --8<-- --8<--

Sep 29 07:12:56 osd-2 bash[13839]: debug 2021-09-29T05:12:56.868+0000
7f25cdf36700  1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread
0x7f25cdf36700' had timed out after 0.000000000s
<repeats several time>
Sep 29 07:13:11 osd-2 bash[13839]: debug 2021-09-29T05:13:11.900+0000
7f25cff3a700  0 log_channel(cluster) log [INF] : mon.osd-2 is new leader,
mons osd-2,osd-5,osd-4,osd-3 in quorum (ranks 1,2,3,4)
Sep 29 07:13:12 osd-2 bash[13839]: debug 2021-09-29T05:13:12.291+0000
7f25cff3a700  0 mon.osd-2@1(leader) e11 handle_command
mon_command([{prefix=config-key set,
key=device/Micron_5300_MTFDDAK1T9TDS_202528E4AA91}] v 0) v1
Sep 29 07:13:12 osd-2 bash[13839]: debug 2021-09-29T05:13:12.291+0000
7f25cff3a700  0 mon.osd-2@1(leader) e11 handle_command
mon_command({"prefix":"config
rm","who":"mgr","name":"mgr/rbd_support/osd-2/mirror_snapshot_schedule"} v
0) v1
Sep 29 07:13:12 osd-2 bash[13839]: debug 2021-09-29T05:13:12.291+0000
7f25cff3a700  0 log_channel(audit) log [INF] : from='mgr.66140493
172.16.62.11:0/3686442004' entity='mgr.osd-2' cmd=[{"prefix":"config
rm","who":"mgr","name":"mgr/rbd_support/osd-2/mirror_snapshot_schedule"}]:
dispatch
Sep 29 07:13:12 osd-2 bash[13839]: debug 2021-09-29T05:13:12.292+0000
7f25cff3a700  0 mon.osd-2@1(leader) e11 handle_command
mon_command({"prefix":"config
rm","who":"mgr","name":"mgr/rbd_support/osd-2/trash_purge_schedule"} v 0) v1
Sep 29 07:13:12 osd-2 bash[13839]: debug 2021-09-29T05:13:12.292+0000
7f25cff3a700  0 log_channel(audit) log [INF] : from='mgr.66140493
172.16.62.11:0/3686442004' entity='mgr.osd-2' cmd=[{"prefix":"config
rm","who":"mgr","name":"mgr/rbd_support/osd-2/trash_purge_schedule"}]:
dispatch
Sep 29 07:13:12 osd-2 bash[13839]: debug 2021-09-29T05:13:12.546+0000
7f25cff3a700  0 mon.osd-2@1(leader) e11 handle_command
mon_command([{prefix=config-key set, key=mgr/telemetry/last_upload}] v 0) v1
Sep 29 07:13:13 osd-2 bash[13839]: debug 2021-09-29T05:13:13.414+0000
7f25d173d700  1 mon.osd-2@1(leader) e11 handle_auth_request failed to
assign global_id
<repeats several times>
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:17.596+0000
7f25cff3a700  0 mon.osd-2@1(leader) e11 handle_command
mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-5}] v 0) v1
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.612+0000
7f25cff3a700 -1 mon.osd-2@1(leader) e11 get_health_metrics reporting 49385
slow ops, oldest is mon_command([{prefix=config-key set,
key=mgr/cephadm/host.osd-5}] v 0)
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.634+0000
7f25cd735700  0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN
1 clients failing to respond to capability release; 1 MDSs report slow
requests; 1/5 mons down, quorum osd-1,osd-5,osd-4,osd-3
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.635+0000
7f25cd735700  0 log_channel(cluster) log [WRN] : [WRN]
MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.635+0000
7f25cd735700  0 log_channel(cluster) log [WRN] :
mds.cephfs.osd-1.qkzuas(mds.0): Client gw-1 failing to respond to
capability release client_id: 66086277
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.635+0000
7f25cd735700  0 log_channel(cluster) log [WRN] : [WRN] MDS_SLOW_REQUEST: 1
MDSs report slow requests
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.635+0000
7f25cd735700  0 log_channel(cluster) log [WRN] :
mds.cephfs.osd-1.qkzuas(mds.0): 8 slow requests are blocked > 30 secs
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.636+0000
7f25cd735700  0 log_channel(cluster) log [WRN] : [WRN] MON_DOWN: 1/5 mons
down, quorum osd-1,osd-5,osd-4,osd-3
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.636+0000
7f25cd735700  0 log_channel(cluster) log [WRN] :     mon.osd-2 (rank 1)
addr [v2:172.16.62.11:3300/0,v1:172.16.62.11:6789/0] is down (out of quorum)
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.636+0000
7f25cd735700  0 log_channel(cluster) log [INF] : mon.osd-2 calling monitor
election
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.636+0000
7f25cd735700  1 paxos.1).electionLogic(21993) init, last seen epoch 21993,
mid-election, bumping
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.646+0000
7f25cd735700  1 mon.osd-2@1(electing) e11 collect_metadata md127:  no
unique device id for md127: fallback method has no model nor serial'
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.651+0000
7f25d173d700  1 mon.osd-2@1(electing) e11 handle_auth_request failed to
assign global_id
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.683+0000
7f25cd735700  1 mon.osd-2@1(electing) e11 collect_metadata md127:  no
unique device id for md127: fallback method has no model nor serial'
Sep 29 07:13:21 osd-2 bash[13839]: debug 2021-09-29T05:13:21.725+0000
7f25ce737700  1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread
0x7f25ce737700' had timed out after 0.000000000s
<repeats several time>
Sep 29 07:13:26 osd-2 bash[13839]: debug 2021-09-29T05:13:26.865+0000
7f25cff3a700  1 mon.osd-2@1(peon).osd e491238 _set_new_cache_sizes
cache_size:134217728 inc_alloc: 67108864 full_alloc: 67108864 kv_alloc:
67108864
Sep 29 07:13:26 osd-2 bash[13839]: debug 2021-09-29T05:13:26.875+0000
7f25cff3a700 -1 mon.osd-2@1(peon) e11 get_health_metrics reporting 49411
slow ops, oldest is mon_command([{prefix=config-key set,
key=mgr/cephadm/host.osd-5}] v 0)

--8<-- --8<-- ceph -s output --8<-- --8<--

# ceph -s
  cluster:
    id:     55633ec3-6c0c-4a02-990c-0f87e0f7a01f
    health: HEALTH_WARN
            1 clients failing to respond to capability release
            1 MDSs report slow requests
            1/5 mons down, quorum osd-1,osd-5,osd-4,osd-3

  services:
    mon: 5 daemons, quorum osd-1,osd-2,osd-5,osd-4,osd-3 (age 2s)
    mgr: osd-2(active, since 13h), standbys: osd-3, osd-5.jcfyqe, osd-1,
osd-4.oylrhe
    mds: 1/1 daemons up, 1 standby
    osd: 180 osds: 180 up (since 19h), 164 in (since 4d)
    rgw: 12 daemons active (6 hosts, 2 zones)

  data:
    volumes: 1/1 healthy
    pools:   14 pools, 5504 pgs
    objects: 263.11M objects, 944 TiB
    usage:   1.4 PiB used, 659 TiB / 2.0 PiB avail
    pgs:     5492 active+clean
             11   active+clean+scrubbing+deep
             1    active+clean+scrubbing

  io:
    client:   39 KiB/s rd, 7.0 MiB/s wr, 52 op/s rd, 57 op/s wr

--8<-- --8<-- heap stats output --8<-- --8<--

# uname -a
Linux osd-2 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021
x86_64 x86_64 x86_64 GNU/Linux
# ceph tell osd.6 heap stats
osd.6 tcmalloc heap stats:------------------------------------------------
MALLOC:     2482344872 ( 2367.3 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +    201900496 (  192.5 MiB) Bytes in central cache freelist
MALLOC: +      8619008 (    8.2 MiB) Bytes in transfer cache freelist
MALLOC: +     42731144 (   40.8 MiB) Bytes in thread cache freelists
MALLOC: +     15335424 (   14.6 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   2750930944 ( 2623.5 MiB) Actual memory used (physical + swap)
MALLOC: +    601866240 (  574.0 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   3352797184 ( 3197.5 MiB) Virtual address space used
MALLOC:
MALLOC:         165291              Spans in use
MALLOC:             40              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via
madvise()).
Bytes released to the OS take up virtual address space but no physical
memory.

On Wed, Sep 29, 2021 at 12:16 AM Christian Wuerdig <
christian.wuerdig@gmail.com> wrote:

> buff/cache is the Linux kernel buffer and page cache which is
> unrelated to the ceph bluestore cache. Check the memory consumption of
> your individual OSD processes to confirm. Top also says 132GB
> available (since buffers and page cache entries will be dropped
> automatically if processes need more RAM) so your output certainly
> does not indicate memory pressure (quite the opposite actually).
> You can also inspect the memory buffers for individual OSDs with
>
> ceph daemon osd.X dump_mempools
> and
> ceph tell osd.X heap stats
>
> On Wed, 29 Sept 2021 at 09:47, Manuel Holtgrewe <zyklenfrei@gmail.com>
> wrote:
> >
> > Hello,
> >
> > I'm running ceph pacific OSD servers that are orchestratedy by cephadm
> (on
> > docker.io v20.10.8 on CentOS 7.9). The servers are a bit ... low
> equipped
> > than others when it comes to memory per OSD.
> >
> > Earlier, we were able to accommodate for this by using the following
> > /etc/ceph/ceph.conf setting:
> >
> > [osd]
> > osd memory target = 2147483648
> >
> > However, since I have switched to cephadm and pacific (I guess this issue
> > is more related to the earlier than the latter), I'm seeing top/htop
> output
> > that indicates that 50% of my memory is used by processes and the other
> 50%
> > are used by "buff/cache".
> >
> > # free -h
> >               total        used        free      shared  buff/cache
> > available
> > Mem:           251G        105G         24G        1.5G        121G
> >  132G
> > Swap:          7.8G         40M        7.7G
> >
> > I'm seeing issues such as mon slowing down and going out of quorum that I
> > saw earlier when memory was tight. Thus, I'm assuming that memory is the
> > issue here again...
> >
> > Thanks,
> > Manuel
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-leave@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic