[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-users
Subject:    [ceph-users] Re: [ext] Re: Re: kernel client osdc ops stuck and mds slow reqs
From:       Ilya Dryomov <idryomov () gmail ! com>
Date:       2023-02-23 16:03:46
Message-ID: CAOi1vP-kVzD_XUREae3huHnbvLYmNnh_ku9Qn7Y8dccxoQ0dAQ () mail ! gmail ! com
[Download RAW message or body]

On Thu, Feb 23, 2023 at 3:31 PM Kuhring, Mathias
<mathias.kuhring@bih-charite.de> wrote:
> 
> Hey Ilya,
> 
> I'm not sure if the things I find in the logs are actually anything related or \
> useful. But I'm not really sure, if I'm looking in the right places.
> 
> I enabled "debug_ms 1" for the OSDs as suggested above.
> But this filled up our host disks pretty fast,  leading to e.g. monitors crashing.
> I disabled the debug messages again and trimmed logs to free up space.
> But I made copies of two OSD logs files which were involved in another capability \
> release / slow requests issue. They are quite big now (~3GB) and even if I remove \
> things like ping stuff, I have more than 1 million lines just for the morning until \
> the disk space was full (around 7 hours). So now I'm wondering how to filter/look \
> for the right things here. 
> When I grep for "error", I get a few of these messages:
> {"log":"debug 2023-02-22T06:18:08.113+0000 7f15c5fff700  1 -- \
> [v2:192.168.1.13:6881/4149819408,v1:192.168.1.13:6884/4149819408] \u003c== osd.161 \
> v2:192.168.1.31:6835/1012436344 182573 ==== pg_update_log_missing(3.1a6s2 epoch \
> 646235/644895 rep_tid 1014320 entries 646235'7672108 (0'0) error    \
> 3:65836dde:::10016e9b7c8.00000000:head by mds.0.1221974:8515830 0.000000 -2 \
> ObjectCleanRegions clean_offsets: [0~18446744073709551615], clean_omap: 1, \
> new_object: 0 trim_to 646178'7662340 roll_forward_to 646192'7672106) v3 ==== \
> 261+0+0 (crc 0 0 0) 0x562d55e52380 con \
> 0x562d8a2de400\n","stream":"stderr","time":"2023-02-22T06:18:08.115002765Z"} 
> And if I grep for "failed", I get a couple of those:
> {"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700  1 -- \
> [v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e \
> 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 msgr2=0x55b9ce07e580 crc :-1 \
> s=STATE_CONNECTION_ESTABLISHED l=1).read_until read \
> failed\n","stream":"stderr","time":"2023-02-22T06:15:25.243808392Z"} {"log":"debug \
> 2023-02-22T06:15:25.242+0000 7f58bbf7c700  1 --2- \
> [v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e \
> 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 0x55b9ce07e580 crc :-1 s=READY \
> pgs=2096664 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 \
> tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) \
> Operation not permitted)\n","stream":"stderr","time":"2023-02-22T06:15:25.243813528Z"}
>  
> Not sure, if they are related to the issue.
> 
> In the kernel logs of the client (dmesg, journalctl or /var/log/messages),
> there seem to be no errors or any stack traces in the relevant time periods.

Hi Mathias,

Then it's very unlikely to be a kernel client issue meaning that you
don't need to worry about your kernel versions.

> The only thing I can see is our restart of the relevant OSDs:
> [Mi Feb 22 07:29:59 2023] libceph: osd90 down
> [Mi Feb 22 07:30:34 2023] libceph: osd90 up
> [Mi Feb 22 07:31:55 2023] libceph: osd93 down
> [Mi Feb 22 08:37:50 2023] libceph: osd93 up
> 
> I noticed a socket closed for another client, but I assume that's more related to \
> monitors failing due to full disks: [Mi Feb 22 05:59:52 2023] libceph: mon2 \
> (1)172.16.62.12:6789 socket closed (con state OPEN) [Mi Feb 22 05:59:52 2023] \
> libceph: mon2 (1)172.16.62.12:6789 session lost, hunting for new mon [Mi Feb 22 \
> 05:59:52 2023] libceph: mon3 (1)172.16.62.13:6789 session established

Yeah, these are expected when a monitor or an OSD goes down.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic