'Re: strange behavior of counter tables after losing a node'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cassandra-user
Subject:    Re: strange behavior of counter tables after losing a node
From:       Attila Wind <attilaw () swf ! technology>
Date:       2021-01-27 10:16:03
Message-ID: 106a5559-87e6-cafb-e6b0-14d6dfb8973e () swf ! technology
[Download RAW message or body]

Thanks Elliott, yepp! This is exactly what we also figured out as a next 
step. Upgrade our TEST env to that so we can re-evaluate the test we did.
Makes 100% sense

Attila Wind

http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932


27.01.2021 10:18 keltezéssel, Elliott Sims írta:
> To start with, maybe update to beta4.  There's an absolute massive 
> list of fixes since alpha4.  I don't think the alphas are expected to 
> be in a usable/low-bug state necessarily, where beta4 is approaching 
> RC status.
>
> On Tue, Jan 26, 2021, 10:44 PM Attila Wind <attilaw@swf.technology> wrote:
>
>     Hey All,
>
>     I'm coming back on my own question (see below) as this has
>     happened again to us 2 days later so we took the time to further
>     analyse this issue. I'd like to share our experiences and the
>     workaround which we figured out too.
>
>     So to just quickly sum up the most important details again:
>
>       * we have a 3 nodes cluster - Cassandra 4-alpha4 and RF=2 - in
>         one DC
>       * we are using ONE consistency level in all queries
>       * if we lose one node from the cluster then
>           o non-counter table writes are fine, remaining 2 nodes
>             taking over everything
>           o but counter table writes start to fail with exception
>             "com.datastax.driver.core.exceptions.WriteTimeoutException:
>             Cassandra timeout during COUNTER write query at
>             consistency ONE (1 replica were required but only 0
>             acknowledged the write)"
>           o the two remaining nodes are both producing hints files for
>             the fallen one
>       * just a note: counter_write_request_timeout_in_ms = 10000,
>         write_request_timeout_in_ms = 5000 in our cassandra.yaml
>
>     To test this further bit we did the following:
>
>       * we shut down one of the nodes normally
>         In this case we do not have the above behavior - everything
>         happens as it should, no failures on counter table writes
>         so this is good
>       * we reproduced the issue in our TEST env by hard-killing one of
>         the nodes instead of normal shutdown (simulating a hardware
>         failure as we had in PROD)
>         Bingo, issue starts immediately!
>
>     Based on the above observations the "normal shutdown - no problem"
>     case gave an idea - so now we have a workaround how to get back
>     the cluster into a working state in a case if we would lose a node
>     permanently (or for a long time at least)
>
>      1. (in our case) we stop the App to stop all Cassandra operations
>      2. stop all remaining nodes in the cluster normally
>      3. restart them normally
>
>     This way the remaining nodes realize the failed node is down and
>     they are jumping into expected processing - everything works
>     including counter table writes
>
>     If anyone has any idea what to check / change / do in our cluster
>     I'm all ears! :-)
>
>     thanks
>
>     Attila Wind
>
>     http://www.linkedin.com/in/attilaw
>     <http://www.linkedin.com/in/attilaw>
>     Mobile: +49 176 43556932
>
>
>     22.01.2021 07:35 keltezéssel, Attila Wind írta:
>>
>>     Hey guys,
>>
>>     Yesterday we had an outage after we have lost a node and we saw
>>     such a behavior we can not explain.
>>
>>     Our data schema has both: counter and norma tables. And we have
>>     replicationFactor = 2 and consistency level LOCAL_ONE (explicitly
>>     set)
>>
>>     What we saw:
>>     After a node went down the updates of the counter tables slowed
>>     down. A lot! These updates normally take only a few millisecs but
>>     now started to take 30-60 seconds(!)
>>     At the same time the write ops against non-counter tables did not
>>     show any difference. The app log was silent in a sense of errors.
>>     So the queries - including the counter table updates - were not
>>     failing (otherwise we see exceptions coming from DAO layer
>>     originating from Cassandra driver) at all.
>>     One more thing: only those updates suffered from the above huuuge
>>     wait time where the lost node was involved (due to partition
>>     key). Other updates just went fine
>>
>>     The whole stuff looks like Cassandra internally started to wait -
>>     a lot - for the lost node. Updates finally succeeded without
>>     failure - at least for the App (the client)
>>
>>     Did anyone ever experienced similar behavior?
>>     What could be an explanation for the above?
>>
>>     Some more details: the App is implemented in Java 8, we are using
>>     Datastax driver 3.7.1 and server cluster is running on Cassandra
>>     4.0 alpha 4. Cluster size is 3 nodes.
>>
>>     Any feedback is appreciated! :-)
>>
>>     thanks
>>
>>     -- 
>>     Attila Wind
>>
>>     http://www.linkedin.com/in/attilaw
>>     <http://www.linkedin.com/in/attilaw>
>>     Mobile: +49 176 43556932
>>
>>

[Attachment #3 (text/html)]

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Thanks Elliott, yepp! This is exactly what we also figured out as
      a next step. Upgrade our TEST env to that so we can re-evaluate
      the test we did.<br>
      Makes 100% sense<br>
    </p>
    <div class="moz-signature"><font
        style="font-size:12px;font-family:Arial;font-weight:bold;
        color:#000000;">Attila Wind</font><br>
      <br>
      <font style="font-size:10px;font-family:Arial;font-weight:normal">
        <a href="http://www.linkedin.com/in/attilaw" style="color:
          #000000; text-decoration: none;">http://www.linkedin.com/in/attilaw</a><br>
        <font style="color:#000000;">Mobile: +49 176 43556932</font><br>
        <br>
        <br>
      </font>
    </div>
    <div class="moz-cite-prefix">27.01.2021 10:18 keltezéssel, Elliott
      Sims írta:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAARvq2OfnRtYOj8Hvo_NZhsEZdDfa8XXjM9yoTztnvR0PF7dsA@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="auto">To start with, maybe update to beta4.  There's an
        absolute massive list of fixes since alpha4.  I don't think the
        alphas are expected to be in a usable/low-bug state necessarily,
        where beta4 is approaching RC status.  </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Tue, Jan 26, 2021, 10:44 PM
          Attila Wind <a class="moz-txt-link-rfc2396E" \
href="mailto:attilaw@swf.technology">&lt;attilaw@swf.technology&gt;</a> wrote:<br>  \
</div>  <blockquote class="gmail_quote" style="margin:0 0 0
          .8ex;border-left:1px #ccc solid;padding-left:1ex">
          <div>
            <p>Hey All,</p>
            <p>I'm coming back on my own question (see below) as this
              has happened again to us 2 days later so we took the time
              to further analyse this issue. I'd like to share our
              experiences and the workaround which we figured out too.</p>
            <p>So to just quickly sum up the most important details
              again:</p>
            <ul>
              <li>we have a 3 nodes cluster - Cassandra 4-alpha4 and
                RF=2 - in one DC</li>
              <li>we are using ONE consistency level in all queries<br>
              </li>
              <li>if we lose one node from the cluster then</li>
              <ul>
                <li>non-counter table writes are fine, remaining 2 nodes
                  taking over everything</li>
                <li>but counter table writes start to fail with
                  exception
                  "com.datastax.driver.core.exceptions.WriteTimeoutException:
                  Cassandra timeout during COUNTER write query at
                  consistency ONE (1 replica were required but only 0
                  acknowledged the write)"</li>
                <li>the two remaining nodes are both producing hints
                  files for the fallen one</li>
              </ul>
              <li>just a note: counter_write_request_timeout_in_ms =
                10000, write_request_timeout_in_ms = 5000 in our
                cassandra.yaml<br>
              </li>
            </ul>
            <p>To test this further bit we did the following:</p>
            <ul>
              <li>we shut down one of the nodes normally<br>
                In this case we do not have the above behavior -
                everything happens as it should, no failures on counter
                table writes<br>
                so this is good</li>
              <li>we reproduced the issue in our TEST env by
                hard-killing one of the nodes instead of normal shutdown
                (simulating a hardware failure as we had in PROD)<br>
                Bingo, issue starts immediately!<br>
              </li>
            </ul>
            <p>Based on the above observations the "normal shutdown - no
              problem" case gave an idea - so now we have a workaround
              how to get back the cluster into a working state in a case
              if we would lose a node permanently (or for a long time at
              least)</p>
            <ol>
              <li>(in our case) we stop the App to stop all Cassandra
                operations<br>
              </li>
              <li>stop all remaining nodes in the cluster normally</li>
              <li>restart them normally</li>
            </ol>
            <p>This way the remaining nodes realize the failed node is
              down and they are jumping into expected processing -
              everything works including counter table writes</p>
            <p>If anyone has any idea what to check / change / do in our
              cluster I'm all ears! :-)<br>
            </p>
            <p>thanks<br>
            </p>
            <div><font
                style="font-size:12px;font-family:Arial;font-weight:bold;color:#000000">Attila
  Wind</font><br>
              <br>
              <font
                style="font-size:10px;font-family:Arial;font-weight:normal">
                <a href="http://www.linkedin.com/in/attilaw"
                  style="color:#000000;text-decoration:none"
                  target="_blank" rel="noreferrer"
                  moz-do-not-send="true">http://www.linkedin.com/in/attilaw</a><br>
                <font style="color:#000000">Mobile: +49 176 43556932</font><br>
                <br>
                <br>
              </font> </div>
            <div>22.01.2021 07:35 keltezéssel, Attila Wind írta:<br>
            </div>
            <blockquote type="cite">
              <p>Hey guys,</p>
              <p>Yesterday we had an outage after we have lost a node
                and we saw such a behavior we can not explain.</p>
              <p>Our data schema has both: counter and norma tables. And
                we have replicationFactor = 2 and consistency level
                LOCAL_ONE (explicitly set)<br>
              </p>
              <p>What we saw:<br>
                After a node went down the updates of the counter tables
                slowed down. A lot! These updates normally take only a
                few millisecs but now started to take 30-60 seconds(!) <br>
                At the same time the write ops against non-counter
                tables did not show any difference. The app log was
                silent in a sense of errors. So the queries - including
                the counter table updates - were not failing (otherwise
                we see exceptions coming from DAO layer originating from
                Cassandra driver) at all.<br>
                One more thing: only those updates suffered from the
                above huuuge wait time where the lost node was involved
                (due to partition key). Other updates just went fine<br>
              </p>
              <p>The whole stuff looks like Cassandra internally started
                to wait - a lot - for the lost node. Updates finally
                succeeded without failure - at least for the App (the
                client)</p>
              <p>Did anyone ever experienced similar behavior?<br>
                What could be an explanation for the above?</p>
              <p>Some more details: the App is implemented in Java 8, we
                are using Datastax driver 3.7.1 and server cluster is
                running on Cassandra 4.0 alpha 4. Cluster size is 3
                nodes.<br>
              </p>
              <p>Any feedback is appreciated! :-)</p>
              <p>thanks<br>
              </p>
              <div>-- <br>
                <font
                  style="font-size:12px;font-family:Arial;font-weight:bold;color:#000000">Attila
  Wind</font><br>
                <br>
                <font
                  style="font-size:10px;font-family:Arial;font-weight:normal">
                  <a href="http://www.linkedin.com/in/attilaw"
                    style="color:#000000;text-decoration:none"
                    target="_blank" rel="noreferrer"
                    moz-do-not-send="true">http://www.linkedin.com/in/attilaw</a><br>
                  <font style="color:#000000">Mobile: +49 176 43556932</font><br>
                  <br>
                  <br>
                </font> </div>
            </blockquote>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </body>
</html>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic