[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-users
Subject:    [ceph-users] Re: [External Email] Re: Hardware for new OSD nodes.
From:       Eneko Lacunza <elacunza () binovo ! es>
Date:       2020-10-26 13:51:01
Message-ID: 08aea60b-119d-0455-14ce-25b5d8cd5b4a () binovo ! es
[Download RAW message or body]

Hi Dave,

El 23/10/20 a las 22:28, Dave Hall escribió:
> 
> Eneko,
> 
> # ceph health detail
> HEALTH_WARN BlueFS spillover detected on 7 OSD(s)
> BLUEFS_SPILLOVER BlueFS spillover detected on 7 OSD(s)
> osd.1 spilled over 648 MiB metadata from 'db' device (28 GiB used 
> of 124 GiB) to slow device
> osd.3 spilled over 613 MiB metadata from 'db' device (28 GiB used 
> of 124 GiB) to slow device
> osd.4 spilled over 485 MiB metadata from 'db' device (28 GiB used 
> of 124 GiB) to slow device
> osd.10 spilled over 1008 MiB metadata from 'db' device (28 GiB 
> used of 124 GiB) to slow device
> osd.17 spilled over 808 MiB metadata from 'db' device (28 GiB 
> used of 124 GiB) to slow device
> osd.18 spilled over 2.5 GiB metadata from 'db' device (28 GiB 
> used of 124 GiB) to slow device
> osd.20 spilled over 1.5 GiB metadata from 'db' device (28 GiB 
> used of 124 GiB) to slow device
> 
> nvme0n1                              259:1    0   1.5T  0 disk
> ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--6dcbb748--13f5--45cb--9d49--6c78d6589a71
>  │                              253:1    0   124G  0 lvm
> ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--736a22a8--e4aa--4da9--b63b--295d8f5f2a3d
>  │                              253:3    0   124G  0 lvm
> ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--751c6623--9870--4123--b551--1fd7fc837341
>  │                              253:5    0   124G  0 lvm
> ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--2a376e8d--abb1--42af--a4bd--4ae8734d703e
>  │                              253:7    0   124G  0 lvm
> ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--54fbe282--9b29--422b--bdb2--d7ed730bc589
>  │                              253:9    0   124G  0 lvm
> ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--c1153cd2--2ec0--4e7f--a3d7--91dac92560ad
>  │                              253:11   0   124G  0 lvm
> ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--d613f4eb--6ddc--4dd5--a2b5--cb520b6ba922
>  │                              253:13   0   124G  0 lvm
> └─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--41f75c25--67db--46e8--a3fb--ddee9e7f7fc4
>  253:15   0   124G  0 lvm
> 

So, this means that if you use 300GB WAL/DB partitions, your spillovers 
will be over (Bluestore is only ysing 28GiB as you can see).

I don't know what is the performance penalty of current spillover, but 
at least you know those 300GB will be of use :)

Cheers


> Dave Hall
> Binghamton University
> kdhall@binghamton.edu  <mailto:kdhall@binghamton.edu>
> 607-760-2328 (Cell)
> 607-777-4641 (Office)
> On 10/23/2020 6:00 AM, Eneko Lacunza wrote:
> > Hi Dave,
> > 
> > El 22/10/20 a las 19:43, Dave Hall escribió:
> > > 
> > > > El 22/10/20 a las 16:48, Dave Hall escribió:
> > > > > 
> > > > > (BTW, Nautilus 14.2.7 on Debian non-container.)
> > > > > 
> > > > > We're about to purchase more OSD nodes for our cluster, but I have 
> > > > > a couple questions about hardware choices.  Our original nodes 
> > > > > were 8 x 12TB SAS drives and a 1.6TB Samsung NVMe card for WAL, 
> > > > > DB, etc.
> > > > > 
> > > > > We chose the NVMe card for performance since it has an 8 lane PCIe 
> > > > > interface.  However, we're currently BlueFS spillovers.
> > > > > 
> > > > > The Tyan chassis we are considering has the option of 4 x U.2 NVMe 
> > > > > bays - each with 4 PCIe lanes, (and 8 SAS bays).   It has occurred 
> > > > > to me that I might stripe 4 1TB NVMe drives together to get much 
> > > > > more space for WAL/DB and a net performance of 16 PCIe lanes.
> > > > > 
> > > > > Any thoughts on this approach?
> > > > Don't stripe them, if one NVMe fails you'll lose all OSDs. Just use 
> > > > 1 NVMe drive for 2  SAS drives  and provision 300GB for WAL/DB for 
> > > > each OSD (see related threads on this mailing list about why that 
> > > > exact size).
> > > > 
> > > > This way if a NVMe fails, you'll only lose 2 OSD.
> > > I was under the impression that everything that BlueStore puts on 
> > > the SSD/NVMe could be reconstructed from information on the OSD. Am 
> > > I mistaken about this?  If so, my single 1.6TB NVMe card is equally 
> > > vulnerable.
> > 
> > I don't think so, that info only exists on that partition as was the 
> > case with filestore journal. Your single 1.6TB NVMe is vulnerable, yes.
> > 
> > > > 
> > > > Also, what size of WAL/DB partitions do you have now, and what 
> > > > spillover size?
> > > 
> > > I recently posted another question to the list on this topic, since 
> > > I now have spillover on 7 of 24 OSDs.  Since the data layout on the 
> > > NVMe for BlueStore is not traditional I've never quite figured out 
> > > how to get this information.   The current partition size is 1.6TB 
> > > /12 since we had the possibility to add for more drives to each 
> > > node.  How that was divided between WAL, DB, etc. is something I'd 
> > > like to be able to understand.  However, we're not going to add the 
> > > extra 4 drives, so expanding the LVM partitions is now a possibility.
> > Can you paste the warning message? If shows the spillover size. What 
> > size are the partitions on NVMe disk (lsblk)
> > 
> > 
> > Cheers
> > 


-- 
Eneko Lacunza                | +34 943 569 206
                              | elacunza@binovo.es
Zuzendari teknikoa           | https://www.binovo.es
Director técnico             | Astigarragako Bidea, 2 - 2º izda.
BINOVO IT HUMAN PROJECT S.L  | oficina 10-11, 20180 Oiartzun

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic