'[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-users
Subject:    [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
From:       Frank Schilder <frans () dtu ! dk>
Date:       2020-10-27 6:57:20
Message-ID: 91b1c565ef854c62a513b07402bf8d7c () dtu ! dk
[Download RAW message or body]

Thanks for digging this out. I believed to remember exactly this method (don't know \
where from), but couldn't find it in the documentation and started doubting it. Yes, \
this would be very useful information to add to the documentation and it also \
confirms that your simpler setup with just a specialized crush rule will work exactly \
as intended and is long-term stable.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: 胡 玮文 <huww98@outlook.com>
Sent: 26 October 2020 17:19
To: Frank Schilder
Cc: Anthony D'Atri; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

> 在 2020年10月26日，15:43，Frank Schilder <frans@dtu.dk> 写道：
> 
> 
> > I've never seen anything that implies that lead OSDs within an acting set are a \
> > function of CRUSH rule ordering.
> 
> This is actually a good question. I believed that I had seen/heard that somewhere, \
> but I might be wrong. 
> Looking at the definition of a PG, is states that a PG is an ordered set of OSD \
> (IDs) and the first up OSD will be the primary. In other words, it seems that the \
> lowest OSD ID is decisive. If the SSDs were deployed before the HDDs, they have the \
> smallest IDs and, hence, will be preferred as primary OSDs.

I don't think this is correct. From my experiments, using previously mentioned CRUSH \
rule, no matter what the IDs of the SSD OSDs are, the primary OSDs are always SSD.

I also have a look at the code, if I understand it correctly:

* If the default primary affinity is not changed, then the logic about primary \
affinity is skipped, and the primary would be the first one returned by CRUSH \
algorithm [1].

* The order of OSDs returned by CRUSH still matters if you changed the primary \
affinity. The affinity represents the probability of a test to be success. The first \
                OSD will be tested first, and will have higher probability to become \
                primary. [2]
  * If any OSD has primary affinity = 1.0, the test will always success, and any OSD \
                after it will never be primary.
  * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to 0.5. Then the \
2nd OSD has probability of 0.25 to be primary, 3rd one has probability of 0.125. \
                Otherwise, 1st will be primary.
  * If no test success (Suppose all OSDs have affinity of 0), 1st OSD will be primary \
as fallback.

[1]: https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2456
 [2]: https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2561


So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient for it to be \
the primary in my case.

Do you think I should contribute these to documentation?

> This, however, is not a sustainable situation. Any addition of OSDs will mess this \
> up and the distribution scheme will fail in the future. A way out seem to be: 
> - subdivide your HDD storage using device classes:
> * define a device class for HDDs with primary affinity=0, for example, pick 5 HDDs \
>                 and change their device class to hdd_np (for no primary)
> * set the primary affinity of these HDD OSDs to 0
> * modify your crush rule to use "step take default class hdd_np"
> * this will create a pool with primaries on SSD and balanced storage distribution \
>                 between SSD and HDD
> * all-HDD pools deployed as usual on class hdd
> * when increasing capacity, one needs to take care of adding disks to hdd_np class \
>                 and set their primary affinity to 0
> * somewhat increased admin effort, but fully working solution
> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic