[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-devel
Subject:    Re: Adding Data-At-Rest compression support to Ceph
From:       Igor Fedotov <ifedotov () mirantis ! com>
Date:       2015-09-28 16:56:29
Message-ID: 5609713D.7050607 () mirantis ! com
[Download RAW message or body]


On 25.09.2015 17:14, Sage Weil wrote:
> On Fri, 25 Sep 2015, Igor Fedotov wrote:
>> Another thing to note is that we don't have the whole object ready for
>> compression. We just have some new data block written(appended) to the object.
>> And we should either compress that block and save mentioned mapping data or
>> decompress the existing object data and do full compression again.
>> And IMO introducing seek points is largely similar to what we were talking
>> about - it requires a sort of offset mapping as well.
>>
>> Probably compression at OSD has some Pros as well. But it wouldn't eliminate
>> the need to "muck with stripe sizes or anything".
> I think the best option here is going to be to compress the "stripe unit".
> I.e., if you have a stripe_size of 64K, and are doing k=4 m=2, then the
> stripe unit is 16K (64/4).  Then each shard has an independent unit it can
> compress/decompress and we don't break the ability to read a small extent
> by talking to only a single shard.
Sage, are you considering compression applied after erasure coding here?
Please note that one needs to compress additional 50% of data this way. 
Generated 'm' chunks need to be processed as well.
And you lose an ability to perform recovery on OSD down without applying 
decompression ( and probably another compression) to remaining shards.

Contrary doing compression before EC produces reduced data set for EC  ( 
some CPU cycles saving)  and is suitable for recovery procedure not 
involving additional decompression/compression pair.
But I suppose 'stripe unit' from the above wouldn't work in this case - 
compression entity has to produce  blocks having "stripe unit" size. 
This way you can fit all compressed data into single shard only. But 
that's hard to achieve....

Thus as usual we should choose what drawbacks(benefits) are less(more) 
important here:
ability to read small extent from single shard + increased data set for 
compression vs. ability to omit total decompression on recovery + 
reduced data set for EC.



>
> *Maybe* the shard could compress contiguous stripe units if multiple
> stripes are written together..
>
> In any case, though, there will some metadata it has to track with the
> object, because the stripe units are no longer fixed size, and there will
> be object_size/stripe_size of them.  I forget if we are already storing a
> CRC for each stripe unit or if it is for the entire shard... if it's the
> former then this won't be a huge change, I think.
>
> sage
>
>
>
>> On 24.09.2015 20:53, Samuel Just wrote:
>>> The catch is that currently accessing 4k in the middle of a 4MB object
>>> does not require reading the whole object, so you'd need some kind of
>>> logical offset -> compressed offset mapping.
>>> -Sam
>>>
>>> On Thu, Sep 24, 2015 at 10:36 AM, Robert LeBlanc <robert@leblancnet.us>
>>> wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>>
>>>> I'm probably missing something, but since we are talking about data at
>>>> rest, can't we just have the OSD compress the object as it goes to
>>>> disk? Instead of
>>>> rbd\udata.1ba49c10d9b00c.0000000000006859__head_2AD1002B__11 it would
>>>> be
>>>> rbd\udata.1ba49c10d9b00c.0000000000006859__head_2AD1002B__11.{gz,xz,bz2,lzo,etc}.
>>>> Then it seems that you don't have to muck with stripe sizes or
>>>> anything. For compressible objects they would be less than 4MB, some
>>>> of theses algorithms already say if it is not compressible enough,
>>>> just store it.
>>>>
>>>> Something like zlib Z_FULL_FLUSH may help provide some seek points
>>>> within an archive to prevent decompressing the whole object for reads?
>>>>
>>>> - ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>
>>>>
>>>> On Thu, Sep 24, 2015 at 10:25 AM, Igor Fedotov  wrote:
>>>>> On 24.09.2015 19:03, Sage Weil wrote:
>>>>>> On Thu, 24 Sep 2015, Igor Fedotov wrote:
>>>>>>
>>>>>> Dynamic stripe sizes are possible but it's a significant change from
>>>>>> the
>>>>>> way the EC pool currently works. I would make that a separate project
>>>>>> (as
>>>>>> its useful in its own right) and not complicate the compression
>>>>>> situation.
>>>>>> Or, if it simplifies the compression approach, then I'd make that
>>>>>> change
>>>>>> first. sage
>>>>> Just to clarify a bit. What I saw when played with Ceph. Please correct
>>>>> me
>>>>> if I'm wrong..
>>>>>
>>>>> For low-level RADOS access client data written to EC pool has to be
>>>>> aligned
>>>>> with stripe size . The last block can be unaligned though but no more
>>>>> appends are permitted in this case.
>>>>> Data copied from cache goes in blocks up to 8Mb size. In general case
>>>>> the
>>>>> last block seems to have unaligned size too.
>>>>>
>>>>> EC pool additionally performs alignment of the incoming blocks to stripe
>>>>> bound internally. This way blocks going to EC lib are always aligned.
>>>>> We should probably perform compression prior to this alignment.
>>>>> Thus some dependency on stripe size is present in EC pools but it's not
>>>>> that
>>>>> strict.
>>>>>
>>>>> Thanks,
>>>>> Igor
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: Mailvelope v1.1.0
>>>> Comment: https://www.mailvelope.com
>>>>
>>>> wsFcBAEBCAAQBQJWBDSDCRDmVDuy+mK58QAAmwwP/3q0tbLZA95RVsvSLrXk
>>>> ipuhjiGPvAX8o2kTYFtf5tXkMuiJIJIy+WK1uD6zs+CXM/2JR6SJthS3tE9A
>>>> meaFW7W5lropbWKRZ8TkpUNQAXDyRrpSEcTDBWciq+EOca5tlP+17KDevVnZ
>>>> PWDCNPlZmbHyBy91iJju4TTzaJYoD8mXU/+4xLCicePDPomlpO4oyndDfOmI
>>>> JP5uRDmgP0ecsxfcyoYSTCJylfnBsmK0IMyxZoV2Mx+SEcqgtECPCOY7Uc/4
>>>> wwXGhu//zO7twyOvtsk4OQGjLX9wpSpVWz+zcR2RYiYfw3YSTSzGvbBC5hpb
>>>> pfQya5DbypJra2oz5BZkikvwYPhxPoI0FcdTCYFFxclm0jMwQqh2b141kN8Z
>>>> eR7v8ttfnbACumWP74j2KSpHRm/1l65nN4wqzg3ovoesjoJDvb2miz8AX7ag
>>>> FXVa54JpIcoIzCkIkqvpCfzhatGU55yQiyt7aFAhJfpmP/cNpxmAete8buTK
>>>> 6aFMiYWFJe+md/bLOrk5g/cyr9BUq+tHT7Qf+mRmgw9fuECUXMXMzf6vOUk8
>>>> 0JnYiYVk0j+twZeuDaVPBrXEMKuYuq7NlILuHJDF3meRPM2xekan8ARZoJxL
>>>> XAOzvaEFly0TH5DJfItSVOL86qtp+1orULSrVbtvolxzQtv8xiNOzJYBKEnO
>>>> ouVI
>>>> =d8mm
>>>> -----END PGP SIGNATURE-----
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic