[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-users
Subject:    [ceph-users] osd out cant' bring it back online
From:       Oliver Weinmann <oliver.weinmann () me ! com>
Date:       2020-11-30 14:55:19
Message-ID: af8a224f-c208-482a-aade-967d6718f7c7 () me ! com
[Download RAW message or body]

Hi,



I'm still evaluating ceph 15.2.5 in a lab so the problem is not really hurting me, \
but I want to understand it and hopefully fix it. It is a good practice. To test the \
resilience of the cluster I try to break it by doing all kinds of things. Today I \
powered off (clean shutdown) one osd node and powered it back on. Last time I tried \
this there was no problem getting it back online. After a few minutes the cluster \
health was back to ok. This time it stayed degraded forever. I checked and noticed \
that the service osd.0 on the osd node was failing. So i used google and there people \
recommended to simply delete the osd and re-create it. I tried it and still can't get \
the osd back in service.



First I removed the osd:



[root@gedasvl02 ~]# ceph osd out 0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config \
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config \
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 osd.0 is already out.
[root@gedasvl02 ~]# ceph auth del 0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config \
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config \
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 Error EINVAL: bad entity \
name [root@gedasvl02 ~]# ceph auth del osd.0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config \
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config \
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 updated
[root@gedasvl02 ~]# ceph osd rm 0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config \
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config \
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 removed osd.0
[root@gedasvl02 ~]# ceph osd tree
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config \
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config \
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 ID  CLASS  WEIGHT   TYPE \
                NAME           STATUS  REWEIGHT  PRI-AFF
-1         0.43658  root default
-7         0.21829      host gedaopl01
 2    ssd  0.21829          osd.2           up   1.00000  1.00000
-3               0      host gedaopl02
-5         0.21829      host gedaopl03
 3    ssd  0.21829          osd.3           up   1.00000  1.00000



Looks ok it's gone...



Then i zapped it:



[root@gedasvl02 ~]# ceph orch device zap gedaopl02 /dev/sdb --force
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config \
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config \
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 \
INFO:cephadm:/usr/bin/podman:stderr WARNING: The same type, major and minor should \
not be used for multiple devices. INFO:cephadm:/usr/bin/podman:stderr --> Zapping: \
/dev/sdb INFO:cephadm:/usr/bin/podman:stderr --> Zapping lvm member /dev/sdb. lv_path \
is /dev/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a/osd-block-3a79800d-2a19-45d8-a850-82c6a8113323
 INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero \
of=/dev/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a/osd-block-3a79800d-2a19-45d8-a850-82c6a8113323 \
bs=1M count=10 conv=fsync INFO:cephadm:/usr/bin/podman:stderr  stderr: 10+0 records \
in INFO:cephadm:/usr/bin/podman:stderr 10+0 records out
INFO:cephadm:/usr/bin/podman:stderr 10485760 bytes (10 MB, 10 MiB) copied, 0.0314447 \
s, 333 MB/s INFO:cephadm:/usr/bin/podman:stderr  stderr:
INFO:cephadm:/usr/bin/podman:stderr --> Only 1 LV left in VG, will proceed to destroy \
volume group ceph-3bf1bb28-0858-4464-a848-d7f56319b40a \
INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/sbin/vgremove -v -f \
ceph-3bf1bb28-0858-4464-a848-d7f56319b40a INFO:cephadm:/usr/bin/podman:stderr  \
stderr: Removing ceph--3bf1bb28--0858--4464--a848--d7f56319b40a-osd--block--3a79800d--2a19--45d8--a850--82c6a8113323 \
(253:0) INFO:cephadm:/usr/bin/podman:stderr  stderr: Archiving volume group \
"ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" metadata (seqno 5). \
INFO:cephadm:/usr/bin/podman:stderr  stderr: Releasing logical volume \
"osd-block-3a79800d-2a19-45d8-a850-82c6a8113323" INFO:cephadm:/usr/bin/podman:stderr  \
stderr: Creating volume group backup \
"/etc/lvm/backup/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" (seqno 6). \
INFO:cephadm:/usr/bin/podman:stderr  stdout: Logical volume \
"osd-block-3a79800d-2a19-45d8-a850-82c6a8113323" successfully removed \
INFO:cephadm:/usr/bin/podman:stderr  stderr: Removing physical volume "/dev/sdb" from \
volume group "ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" \
INFO:cephadm:/usr/bin/podman:stderr  stdout: Volume group \
"ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" successfully removed \
INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero \
of=/dev/sdb bs=1M count=10 conv=fsync INFO:cephadm:/usr/bin/podman:stderr  stderr: \
10+0 records in INFO:cephadm:/usr/bin/podman:stderr 10+0 records out
INFO:cephadm:/usr/bin/podman:stderr  stderr: 10485760 bytes (10 MB, 10 MiB) copied, \
0.0355641 s, 295 MB/s INFO:cephadm:/usr/bin/podman:stderr --> Zapping successful for: \
<Raw Device: /dev/sdb>



And re-added it:



[root@gedasvl02 ~]# ceph orch daemon add osd gedaopl02:/dev/sdb
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config \
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config \
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 Created osd(s) 0 on host \
'gedaopl02'



But the osd is still out...



[root@gedasvl02 ~]# ceph osd tree
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config \
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config \
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 ID  CLASS  WEIGHT   TYPE \
                NAME           STATUS  REWEIGHT  PRI-AFF
-1         0.43658  root default
-7         0.21829      host gedaopl01
 2    ssd  0.21829          osd.2           up   1.00000  1.00000
-3               0      host gedaopl02
-5         0.21829      host gedaopl03
 3    ssd  0.21829          osd.3           up   1.00000  1.00000
 0               0  osd.0                 down         0  1.00000



Looking at the cluster log in the webui i see the following error:



Failed to apply osd.dashboard-admin-1606745745154 spec \
DriveGroupSpec(name=dashboard-admin-1606745745154->placement=PlacementSpec(host_pattern='*'), \
service_id='dashboard-admin-1606745745154', service_type='osd', \
data_devices=DeviceSelection(size='223.6GB', rotational=False, all=False), \
osd_id_claims={}, unmanaged=False, filter_logic='AND', preview_only=False): No \
filters applied Traceback (most recent call last): File \
"/usr/share/ceph/mgr/cephadm/module.py", line 2108, in _apply_all_services if \
self._apply_service(spec): File "/usr/share/ceph/mgr/cephadm/module.py", line 2005, \
in _apply_service self.osd_service.create_from_spec(cast(DriveGroupSpec, spec)) File \
"/usr/share/ceph/mgr/cephadm/services/osd.py", line 43, in create_from_spec ret = \
create_from_spec_one(self.prepare_drivegroup(drive_group)) File \
"/usr/share/ceph/mgr/cephadm/services/osd.py", line 127, in prepare_drivegroup \
drive_selection = DriveSelection(drive_group, inventory_for_host) File \
"/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 32, \
in __init__ self._data = self.assign_devices(self.spec.data_devices) File \
"/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 138, \
in assign_devices if not all(m.compare(disk) for m in \
FilterGenerator(device_filter)): File \
"/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 138, \
in <genexpr> if not all(m.compare(disk) for m in FilterGenerator(device_filter)): \
File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/matchers.py", line \
410, in compare raise Exception("No filters applied") Exception: No filters applied


I have another error "pgs undersized", maybe this is also causing trouble?



[root@gedasvl02 ~]# ceph -s
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config \
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config \
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15  cluster:
    id:     d0920c36-2368-11eb-a5de-005056b703af
    health: HEALTH_WARN
            Degraded data redundancy: 13142/39426 objects degraded (33.333%), 176 pgs \
degraded, 225 pgs undersized

  services:
    mon: 1 daemons, quorum gedasvl02 (age 2w)
    mgr: gedasvl02.vqswxg(active, since 2w), standbys: gedaopl02.yrwzqh
    mds: cephfs:1 {0=cephfs.gedaopl01.zjuhem=up:active} 1 up:standby
    osd: 3 osds: 2 up (since 4d), 2 in (since 94m)

  task status:
    scrub status:
        mds.cephfs.gedaopl01.zjuhem: idle

  data:
    pools:   7 pools, 225 pgs
    objects: 13.14k objects, 77 GiB
    usage:   148 GiB used, 299 GiB / 447 GiB avail
    pgs:     13142/39426 objects degraded (33.333%)
             176 active+undersized+degraded
             49  active+undersized

  io:
    client:   0 B/s rd, 6.1 KiB/s wr, 0 op/s rd, 0 op/s wr



Best Regards,

Oliver
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic