[prev in list] [next in list] [prev in thread] [next in thread]
List: cassandra-user
Subject: Re: strange behavior of counter tables after losing a node
From: Attila Wind <attilaw () swf ! technology>
Date: 2021-01-27 10:16:03
Message-ID: 106a5559-87e6-cafb-e6b0-14d6dfb8973e () swf ! technology
[Download RAW message or body]
Thanks Elliott, yepp! This is exactly what we also figured out as a next
step. Upgrade our TEST env to that so we can re-evaluate the test we did.
Makes 100% sense
Attila Wind
http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932
27.01.2021 10:18 keltezéssel, Elliott Sims írta:
> To start with, maybe update to beta4. There's an absolute massive
> list of fixes since alpha4. I don't think the alphas are expected to
> be in a usable/low-bug state necessarily, where beta4 is approaching
> RC status.
>
> On Tue, Jan 26, 2021, 10:44 PM Attila Wind <attilaw@swf.technology> wrote:
>
> Hey All,
>
> I'm coming back on my own question (see below) as this has
> happened again to us 2 days later so we took the time to further
> analyse this issue. I'd like to share our experiences and the
> workaround which we figured out too.
>
> So to just quickly sum up the most important details again:
>
> * we have a 3 nodes cluster - Cassandra 4-alpha4 and RF=2 - in
> one DC
> * we are using ONE consistency level in all queries
> * if we lose one node from the cluster then
> o non-counter table writes are fine, remaining 2 nodes
> taking over everything
> o but counter table writes start to fail with exception
> "com.datastax.driver.core.exceptions.WriteTimeoutException:
> Cassandra timeout during COUNTER write query at
> consistency ONE (1 replica were required but only 0
> acknowledged the write)"
> o the two remaining nodes are both producing hints files for
> the fallen one
> * just a note: counter_write_request_timeout_in_ms = 10000,
> write_request_timeout_in_ms = 5000 in our cassandra.yaml
>
> To test this further bit we did the following:
>
> * we shut down one of the nodes normally
> In this case we do not have the above behavior - everything
> happens as it should, no failures on counter table writes
> so this is good
> * we reproduced the issue in our TEST env by hard-killing one of
> the nodes instead of normal shutdown (simulating a hardware
> failure as we had in PROD)
> Bingo, issue starts immediately!
>
> Based on the above observations the "normal shutdown - no problem"
> case gave an idea - so now we have a workaround how to get back
> the cluster into a working state in a case if we would lose a node
> permanently (or for a long time at least)
>
> 1. (in our case) we stop the App to stop all Cassandra operations
> 2. stop all remaining nodes in the cluster normally
> 3. restart them normally
>
> This way the remaining nodes realize the failed node is down and
> they are jumping into expected processing - everything works
> including counter table writes
>
> If anyone has any idea what to check / change / do in our cluster
> I'm all ears! :-)
>
> thanks
>
> Attila Wind
>
> http://www.linkedin.com/in/attilaw
> <http://www.linkedin.com/in/attilaw>
> Mobile: +49 176 43556932
>
>
> 22.01.2021 07:35 keltezéssel, Attila Wind írta:
>>
>> Hey guys,
>>
>> Yesterday we had an outage after we have lost a node and we saw
>> such a behavior we can not explain.
>>
>> Our data schema has both: counter and norma tables. And we have
>> replicationFactor = 2 and consistency level LOCAL_ONE (explicitly
>> set)
>>
>> What we saw:
>> After a node went down the updates of the counter tables slowed
>> down. A lot! These updates normally take only a few millisecs but
>> now started to take 30-60 seconds(!)
>> At the same time the write ops against non-counter tables did not
>> show any difference. The app log was silent in a sense of errors.
>> So the queries - including the counter table updates - were not
>> failing (otherwise we see exceptions coming from DAO layer
>> originating from Cassandra driver) at all.
>> One more thing: only those updates suffered from the above huuuge
>> wait time where the lost node was involved (due to partition
>> key). Other updates just went fine
>>
>> The whole stuff looks like Cassandra internally started to wait -
>> a lot - for the lost node. Updates finally succeeded without
>> failure - at least for the App (the client)
>>
>> Did anyone ever experienced similar behavior?
>> What could be an explanation for the above?
>>
>> Some more details: the App is implemented in Java 8, we are using
>> Datastax driver 3.7.1 and server cluster is running on Cassandra
>> 4.0 alpha 4. Cluster size is 3 nodes.
>>
>> Any feedback is appreciated! :-)
>>
>> thanks
>>
>> --
>> Attila Wind
>>
>> http://www.linkedin.com/in/attilaw
>> <http://www.linkedin.com/in/attilaw>
>> Mobile: +49 176 43556932
>>
>>
[Attachment #3 (text/html)]
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Thanks Elliott, yepp! This is exactly what we also figured out as
a next step. Upgrade our TEST env to that so we can re-evaluate
the test we did.<br>
Makes 100% sense<br>
</p>
<div class="moz-signature"><font
style="font-size:12px;font-family:Arial;font-weight:bold;
color:#000000;">Attila Wind</font><br>
<br>
<font style="font-size:10px;font-family:Arial;font-weight:normal">
<a href="http://www.linkedin.com/in/attilaw" style="color:
#000000; text-decoration: none;">http://www.linkedin.com/in/attilaw</a><br>
<font style="color:#000000;">Mobile: +49 176 43556932</font><br>
<br>
<br>
</font>
</div>
<div class="moz-cite-prefix">27.01.2021 10:18 keltezéssel, Elliott
Sims írta:<br>
</div>
<blockquote type="cite"
cite="mid:CAARvq2OfnRtYOj8Hvo_NZhsEZdDfa8XXjM9yoTztnvR0PF7dsA@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="auto">To start with, maybe update to beta4. There's an
absolute massive list of fixes since alpha4. I don't think the
alphas are expected to be in a usable/low-bug state necessarily,
where beta4 is approaching RC status. </div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, Jan 26, 2021, 10:44 PM
Attila Wind <a class="moz-txt-link-rfc2396E" \
href="mailto:attilaw@swf.technology"><attilaw@swf.technology></a> wrote:<br> \
</div> <blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>
<p>Hey All,</p>
<p>I'm coming back on my own question (see below) as this
has happened again to us 2 days later so we took the time
to further analyse this issue. I'd like to share our
experiences and the workaround which we figured out too.</p>
<p>So to just quickly sum up the most important details
again:</p>
<ul>
<li>we have a 3 nodes cluster - Cassandra 4-alpha4 and
RF=2 - in one DC</li>
<li>we are using ONE consistency level in all queries<br>
</li>
<li>if we lose one node from the cluster then</li>
<ul>
<li>non-counter table writes are fine, remaining 2 nodes
taking over everything</li>
<li>but counter table writes start to fail with
exception
"com.datastax.driver.core.exceptions.WriteTimeoutException:
Cassandra timeout during COUNTER write query at
consistency ONE (1 replica were required but only 0
acknowledged the write)"</li>
<li>the two remaining nodes are both producing hints
files for the fallen one</li>
</ul>
<li>just a note: counter_write_request_timeout_in_ms =
10000, write_request_timeout_in_ms = 5000 in our
cassandra.yaml<br>
</li>
</ul>
<p>To test this further bit we did the following:</p>
<ul>
<li>we shut down one of the nodes normally<br>
In this case we do not have the above behavior -
everything happens as it should, no failures on counter
table writes<br>
so this is good</li>
<li>we reproduced the issue in our TEST env by
hard-killing one of the nodes instead of normal shutdown
(simulating a hardware failure as we had in PROD)<br>
Bingo, issue starts immediately!<br>
</li>
</ul>
<p>Based on the above observations the "normal shutdown - no
problem" case gave an idea - so now we have a workaround
how to get back the cluster into a working state in a case
if we would lose a node permanently (or for a long time at
least)</p>
<ol>
<li>(in our case) we stop the App to stop all Cassandra
operations<br>
</li>
<li>stop all remaining nodes in the cluster normally</li>
<li>restart them normally</li>
</ol>
<p>This way the remaining nodes realize the failed node is
down and they are jumping into expected processing -
everything works including counter table writes</p>
<p>If anyone has any idea what to check / change / do in our
cluster I'm all ears! :-)<br>
</p>
<p>thanks<br>
</p>
<div><font
style="font-size:12px;font-family:Arial;font-weight:bold;color:#000000">Attila
Wind</font><br>
<br>
<font
style="font-size:10px;font-family:Arial;font-weight:normal">
<a href="http://www.linkedin.com/in/attilaw"
style="color:#000000;text-decoration:none"
target="_blank" rel="noreferrer"
moz-do-not-send="true">http://www.linkedin.com/in/attilaw</a><br>
<font style="color:#000000">Mobile: +49 176 43556932</font><br>
<br>
<br>
</font> </div>
<div>22.01.2021 07:35 keltezéssel, Attila Wind írta:<br>
</div>
<blockquote type="cite">
<p>Hey guys,</p>
<p>Yesterday we had an outage after we have lost a node
and we saw such a behavior we can not explain.</p>
<p>Our data schema has both: counter and norma tables. And
we have replicationFactor = 2 and consistency level
LOCAL_ONE (explicitly set)<br>
</p>
<p>What we saw:<br>
After a node went down the updates of the counter tables
slowed down. A lot! These updates normally take only a
few millisecs but now started to take 30-60 seconds(!) <br>
At the same time the write ops against non-counter
tables did not show any difference. The app log was
silent in a sense of errors. So the queries - including
the counter table updates - were not failing (otherwise
we see exceptions coming from DAO layer originating from
Cassandra driver) at all.<br>
One more thing: only those updates suffered from the
above huuuge wait time where the lost node was involved
(due to partition key). Other updates just went fine<br>
</p>
<p>The whole stuff looks like Cassandra internally started
to wait - a lot - for the lost node. Updates finally
succeeded without failure - at least for the App (the
client)</p>
<p>Did anyone ever experienced similar behavior?<br>
What could be an explanation for the above?</p>
<p>Some more details: the App is implemented in Java 8, we
are using Datastax driver 3.7.1 and server cluster is
running on Cassandra 4.0 alpha 4. Cluster size is 3
nodes.<br>
</p>
<p>Any feedback is appreciated! :-)</p>
<p>thanks<br>
</p>
<div>-- <br>
<font
style="font-size:12px;font-family:Arial;font-weight:bold;color:#000000">Attila
Wind</font><br>
<br>
<font
style="font-size:10px;font-family:Arial;font-weight:normal">
<a href="http://www.linkedin.com/in/attilaw"
style="color:#000000;text-decoration:none"
target="_blank" rel="noreferrer"
moz-do-not-send="true">http://www.linkedin.com/in/attilaw</a><br>
<font style="color:#000000">Mobile: +49 176 43556932</font><br>
<br>
<br>
</font> </div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</body>
</html>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic