[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cassandra-dev
Subject:    Re: CASSANDRA-14227 removing the 2038 limit
From:       Berenguer Blasi <berenguerblasi () gmail ! com>
Date:       2022-11-15 6:06:56
Message-ID: adcb4028-4ed0-2e55-0913-b53319195bd6 () gmail ! com
[Download RAW message or body]

Hi all,

thanks for your answers!.

To Benedict's point: In terms of the uvint enconding of deletionTime 
i.e. it is true it happens here 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SerializationHeader.java#L170. \
 But we also have a DeletionTime serializer here 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L166 \
 that is writing an int and a long that would now write 2 longs.

TTL itself (the delta) remains an int in the new PR so it should have no 
effect in size.

Did I reference the correct parts of the codebase? No sstable expert here.

On 14/11/22 19:28, Josh McKenzie wrote:
> > in 2035 we'd hit the same problem again.
> In terms of "kicking a can down the road", this would be a pretty 
> vigorous kick. I wouldn't push back against this deferral. :)
> 
> On Mon, Nov 14, 2022, at 9:28 AM, Benedict wrote:
> > 
> > I'm confused why we see *any* increase in sstable size - TTLs and 
> > deletion times are already written as unsigned vints as offsets from 
> > an sstable epoch for each value.
> > 
> > I would dig in more carefully to explore why you're seeing this 
> > increase? For the same data there should be no change to size on disk.
> > 
> > 
> > > On 14 Nov 2022, at 06:36, C. Scott Andreas <scott@paradoxica.net> wrote:
> > > A 2-3% increase in storage volume is roughly equivalent to giving 
> > > up the gain from LZ4 -> LZ4HC, or a one to two-level bump in 
> > > Zstandard compression levels. This regression could be very 
> > > expensive for storage-bound use cases.
> > > 
> > > From the perspective of storage overhead, the unsigned int approach 
> > > sounds preferable.
> > > 
> > > > On Nov 13, 2022, at 10:13 PM, Berenguer Blasi 
> > > > <berenguerblasi@gmail.com> wrote:
> > > > 
> > > > 
> > > > Hi all,
> > > > 
> > > > We have done some more research on c14227. The current patch for 
> > > > CASSANDRA-14227 solves the TTL limit issue by switching TTL to long 
> > > > instead of int. This approach does not have a negative impact on 
> > > > memtable memory usage, as C* controles the memory used by the 
> > > > Memtable, but based on our testing it increases the bytes flushed 
> > > > by 4 to 7% and the byte on disk by 2 to 3%.
> > > > 
> > > > As a mitigation to this problem it is possible to encode 
> > > > /localDeletionTime/ as a vint. It results in a 1% improvement but 
> > > > might cause additional computations during compaction or some other 
> > > > operations.
> > > > 
> > > > Benedict's proposal to keep on using ints for TTL but as a delta to 
> > > > nowInSecond would work for memtables but not for work in the 
> > > > SSTable where nowInSecond does not exist. By consequence we would 
> > > > still suffer from the impact on byte flushed and bytes on disk.
> > > > 
> > > > Another approach that was suggested is the use of unsigned integer. 
> > > > Java 8 has an unsigned integer API that would allow us to use 
> > > > unsigned int for TTLs. Based on computation unsigned ints would 
> > > > give us a maximum time of 136 years since the Unix Epoch and 
> > > > therefore a maximum expiration timestamp in 2106. We would have to 
> > > > keep TTL at 20y instead of 68y to give us enough breathing room 
> > > > though, otherwise in 2035 we'd hit the same problem again.
> > > > 
> > > > Happy to hear opinions.
> > > > 
> > > > On 18/10/22 10:56, Berenguer Blasi wrote:
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > apologies for the late reply as I have been OOO. I have done some 
> > > > > profiling and results look virtually identical on trunk and 14227. 
> > > > > I have attached some screenshots to the ticket 
> > > > > https://issues.apache.org/jira/browse/CASSANDRA-14227 
> > > > > <https://issues.apache.org/jira/browse/CASSANDRA-14227>. Unless my 
> > > > > eyes are fooling me everything in the jfrs look the same.
> > > > > 
> > > > > Regards
> > > > > 
> > > > > On 30/9/22 9:44, Berenguer Blasi wrote:
> > > > > > 
> > > > > > Hi Benedict,
> > > > > > 
> > > > > > thanks for the reply! Yes some profiling is probably needed, then 
> > > > > > we can see if going down the delta encoding big refactor rabbit 
> > > > > > hole is worth it?
> > > > > > 
> > > > > > Let's see what other concerns people bring up.
> > > > > > 
> > > > > > Thx.
> > > > > > 
> > > > > > On 29/9/22 11:12, Benedict Elliott Smith wrote:
> > > > > > > My only slight concern with this approach is the additional 
> > > > > > > memory pressure. Since 64yrs should be plenty at any moment in 
> > > > > > > time, I wonder if it wouldn't be better to represent these times 
> > > > > > > as deltas from the nowInSec being used to process the query. So, 
> > > > > > > long math would only be used to normalise the times to this 
> > > > > > > nowInSec (from whatever is stored in the sstable) within a 
> > > > > > > method, and ints would be stored in memtables and any objects 
> > > > > > > used for processing.
> > > > > > > 
> > > > > > > This might admittedly be more work, but I don't believe it 
> > > > > > > should be too challenging - we can introduce a method 
> > > > > > > deletionTime(int nowInSec) that returns a long value by adding 
> > > > > > > nowInSec to the deletionTime, and make the underlying value 
> > > > > > > private, refactoring call sites?
> > > > > > > 
> > > > > > > > On 29 Sep 2022, at 09:37, Berenguer Blasi 
> > > > > > > > <berenguerblasi@gmail.com> <mailto:berenguerblasi@gmail.com> wrote:
> > > > > > > > 
> > > > > > > > Hi all,
> > > > > > > > 
> > > > > > > > I have taken a stab in a PR you can find attached in the 
> > > > > > > > ticket. Mainly:
> > > > > > > > 
> > > > > > > > - I have moved deletion times, gc and nowInSec timestamps to 
> > > > > > > > long. That should get us past the 2038 limit.
> > > > > > > > 
> > > > > > > > - TTL is maxed now to 68y. Think CQL API compatibility and a 
> > > > > > > > sort of a 'free' guardrail.
> > > > > > > > 
> > > > > > > > - A new NONE overflow policy is the default but everything is 
> > > > > > > > backwards compatible by keeping the previous ones in place. 
> > > > > > > > Think upgrade scenarios or apps relying on the previous behavior.
> > > > > > > > 
> > > > > > > > - The new limit is around year 292,471,208,677 which sounds ok 
> > > > > > > > given the Sun will start collapsing in 3 to 5 billion years :-)
> > > > > > > > 
> > > > > > > > - Please feel free to drop by the ticket and take a look at the 
> > > > > > > > PR even if it's cursory
> > > > > > > > 
> > > > > > > > Thx in advance.
> > > > > > > > 
> 


[Attachment #3 (text/html)]

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Hi all,</p>
    <p>thanks for your answers!.</p>
    <p>To Benedict's point: In terms of the uvint enconding of
      deletionTime i.e. it is true it happens here
<a class="moz-txt-link-freetext" \
href="https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ \
SerializationHeader.java#L170">https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SerializationHeader.java#L170</a>.
  But we also have a DeletionTime serializer here
<a class="moz-txt-link-freetext" \
href="https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ \
DeletionTime.java#L166">https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L166</a>
  that is writing an int and a long that would now write 2 longs.</p>
    <p>TTL itself (the delta) remains an int in the new PR so it should
      have no effect in size.<br>
    </p>
    <p>Did I reference the correct parts of the codebase? No sstable
      expert here.<br>
    </p>
    <div class="moz-cite-prefix">On 14/11/22 19:28, Josh McKenzie wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:c14b98fc-ecf8-40f1-84c2-9dd4d47c565e@app.fastmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <title></title>
      <style type="text/css">p.MsoNormal,p.MsoNoSpacing{margin:0}</style>
      <blockquote type="cite">
        <div>in 2035 we'd hit the same problem again.<br>
        </div>
      </blockquote>
      <div>In terms of "kicking a can down the road", this would be a
        pretty vigorous kick. I wouldn't push back against this
        deferral. :)</div>
      <div><br>
      </div>
      <div>On Mon, Nov 14, 2022, at 9:28 AM, Benedict wrote:<br>
      </div>
      <blockquote type="cite" id="qt" style="">
        <div dir="ltr"><br>
        </div>
        <div dir="ltr">I'm confused why we see *any* increase in sstable
          size - TTLs and deletion times are already written as unsigned
          vints as offsets from an sstable epoch for each value.<br>
        </div>
        <div dir="ltr"><br>
        </div>
        <div dir="ltr">I would dig in more carefully to explore why
          you're seeing this increase? For the same data there should be
          no change to size on disk.<br>
        </div>
        <div dir="ltr">
          <div><br>
          </div>
          <div dir="ltr"><br>
          </div>
          <blockquote type="cite">
            <div>On 14 Nov 2022, at 06:36, C. Scott Andreas
              <a class="moz-txt-link-rfc2396E" \
href="mailto:scott@paradoxica.net">&lt;scott@paradoxica.net&gt;</a> wrote:<br>  \
</div>  </blockquote>
        </div>
        <blockquote type="cite">
          <div dir="ltr">
            <div>A 2-3% increase in storage volume is roughly
              equivalent to giving up the gain from LZ4 -&gt; LZ4HC, or
              a one to two-level bump in Zstandard compression levels.
              This regression could be very expensive for storage-bound
              use cases.<br>
            </div>
            <div><br>
            </div>
            <div>
              <div>
                <div>
                  <div>
                    <div dir="ltr">From the perspective of storage
                      overhead, the unsigned int approach sounds
                      preferable.<br>
                    </div>
                    <div dir="ltr">
                      <div><br>
                      </div>
                      <blockquote type="cite">
                        <div>On Nov 13, 2022, at 10:13 PM, Berenguer
                          Blasi <a class="moz-txt-link-rfc2396E" \
href="mailto:berenguerblasi@gmail.com">&lt;berenguerblasi@gmail.com&gt;</a> \
wrote:<br>  </div>
                      </blockquote>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div> <br>
                        </div>
                        <p>Hi all,<br>
                        </p>
                        <p>We have done some more research on c14227.
                          The current patch for CASSANDRA-14227 solves
                          the TTL limit issue by switching TTL to long
                          instead of int. This approach does not have a
                          negative impact on memtable memory usage, as
                          C* controles the memory used by the Memtable,
                          but based on our testing it increases the
                          bytes flushed by 4 to 7% and the byte on disk
                          by 2 to 3%.<br>
                        </p>
                        <p>As a mitigation to this problem it is
                          possible to encode <i>localDeletionTime</i>
                          as a vint. It results in a 1% improvement but
                          might cause additional computations during
                          compaction or some other operations.<span
                            aria-label="" class="qt-c-mrkdwn__br"></span><br>
                        </p>
                        <p>Benedict's proposal to keep on using ints for
                          TTL but as a delta to nowInSecond would work
                          for memtables but not for work in the SSTable
                          where nowInSecond does not exist. By
                          consequence we would still suffer from the
                          impact on byte flushed and bytes on disk.<br>
                        </p>
                        <p>Another approach that was suggested is the
                          use of unsigned integer. Java 8 has an
                          unsigned integer API that would allow us to
                          use unsigned int for TTLs. Based on
                          computation unsigned ints would give us a
                          maximum time of 136 years since the Unix Epoch
                          and therefore a maximum expiration timestamp
                          in 2106. We would have to keep TTL at 20y
                          instead of 68y to give us enough breathing
                          room though, otherwise in 2035 we'd hit the
                          same problem again.<br>
                        </p>
                        <p>Happy to hear opinions.<br>
                        </p>
                        <div class="qt-moz-cite-prefix">On 18/10/22
                          10:56, Berenguer Blasi wrote:<br>
                        </div>
                        <blockquote type="cite"
                          cite="mid:c52e1a63-5344-5f6f-659a-4932aba0aba5@gmail.com">
                          <p>Hi,<br>
                          </p>
                          <p>apologies for the late reply as I have been
                            OOO. I have done some profiling and results
                            look virtually identical on trunk and 14227.
                            I have attached some screenshots to the
                            ticket <a class="qt-moz-txt-link-freetext
                              moz-txt-link-freetext"
                              \
                href="https://issues.apache.org/jira/browse/CASSANDRA-14227"
                              \
moz-do-not-send="true">https://issues.apache.org/jira/browse/CASSANDRA-14227</a>.  \
Unless my eyes are fooling me everything in  the jfrs look the same.<br>
                          </p>
                          <p>Regards<br>
                          </p>
                          <div class="qt-moz-cite-prefix">On 30/9/22
                            9:44, Berenguer Blasi wrote:<br>
                          </div>
                          <blockquote type="cite"
                            \
cite="mid:4d20b63e-4281-4db5-3e83-c0304875de93@gmail.com">  <p>Hi Benedict,<br>
                            </p>
                            <p>thanks for the reply! Yes some profiling
                              is probably needed, then we can see if
                              going down the delta encoding big refactor
                              rabbit hole is worth it?<br>
                            </p>
                            <p>Let's see what other concerns people
                              bring up.<br>
                            </p>
                            <p>Thx.<br>
                            </p>
                            <div class="qt-moz-cite-prefix">On 29/9/22
                              11:12, Benedict Elliott Smith wrote:<br>
                            </div>
                            <blockquote type="cite"
                              \
cite="mid:C00589E9-9014-4537-9538-A1368E75A7EA@apache.org">  <div dir="ltr">
                                <div>My only slight concern with this
                                  approach is the additional memory
                                  pressure. Since 64yrs should be plenty
                                  at any moment in time, I wonder if it
                                  wouldn't be better to represent these
                                  times as deltas from the nowInSec
                                  being used to process the query. So,
                                  long math would only be used to
                                  normalise the times to this nowInSec
                                  (from whatever is stored in the
                                  sstable) within a method, and ints
                                  would be stored in memtables and any
                                  objects used for processing. <br>
                                </div>
                                <div class="qt-AppleOriginalContents"
                                  style="direction:ltr;"><br>
                                </div>
                                <div class="qt-AppleOriginalContents"
                                  style="direction:ltr;">This might
                                  admittedly be more work, but I don't
                                  believe it should be too challenging -
                                  we can introduce a method
                                  deletionTime(int nowInSec) that
                                  returns a long value by adding
                                  nowInSec to the deletionTime, and make
                                  the underlying value private,
                                  refactoring call sites?<br>
                                </div>
                                <div class="qt-AppleOriginalContents"
                                  style="direction:ltr;">
                                  <div><br>
                                  </div>
                                  <blockquote type="cite">
                                    <div>On 29 Sep 2022, at 09:37,
                                      Berenguer Blasi <a
                                        class="qt-moz-txt-link-rfc2396E"
href="mailto:berenguerblasi@gmail.com" \
moz-do-not-send="true">&lt;berenguerblasi@gmail.com&gt;</a>  wrote:<br>
                                    </div>
                                    <div><br>
                                    </div>
                                    <div>
                                      <div>
                                        <div>Hi all,<br>
                                        </div>
                                        <div> <br>
                                        </div>
                                        <div> I have taken a stab in a
                                          PR you can find attached in
                                          the ticket. Mainly:<br>
                                        </div>
                                        <div> <br>
                                        </div>
                                        <div> - I have moved deletion
                                          times, gc and nowInSec
                                          timestamps to long. That
                                          should get us past the 2038
                                          limit.<br>
                                        </div>
                                        <div> <br>
                                        </div>
                                        <div> - TTL is maxed now to 68y.
                                          Think CQL API compatibility
                                          and a sort of a 'free'
                                          guardrail.<br>
                                        </div>
                                        <div> <br>
                                        </div>
                                        <div> - A new NONE overflow
                                          policy is the default but
                                          everything is backwards
                                          compatible by keeping the
                                          previous ones in place. Think
                                          upgrade scenarios or apps
                                          relying on the previous
                                          behavior.<br>
                                        </div>
                                        <div> <br>
                                        </div>
                                        <div> - The new limit is around
                                          year 292,471,208,677 which
                                          sounds ok given the Sun will
                                          start collapsing in 3 to 5
                                          billion years :-)<br>
                                        </div>
                                        <div> <br>
                                        </div>
                                        <div> - Please feel free to drop
                                          by the ticket and take a look
                                          at the PR even if it's cursory<br>
                                        </div>
                                        <div> <br>
                                        </div>
                                        <div> Thx in advance.<br>
                                        </div>
                                        <div> <br>
                                        </div>
                                      </div>
                                    </div>
                                  </blockquote>
                                </div>
                              </div>
                            </blockquote>
                          </blockquote>
                        </blockquote>
                      </div>
                    </blockquote>
                  </div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
      </blockquote>
      <div><br>
      </div>
    </blockquote>
  </body>
</html>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic