[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cassandra-user
Subject:    Re: Nodes go down periodically
From:       Joel Samuelsson <samuelsson.joel () gmail ! com>
Date:       2016-02-24 6:04:28
Message-ID: CAEMs6zSX2Kr-rJgi_XR0h1n+AHJKX8w2sKi+eDgYX--sHe0EWw () mail ! gmail ! com
[Download RAW message or body]

"Is it only one node at a time that goes down, and at widely dispersed
times?"
It is a two node cluster so both nodes consider the other node down at the
same time.

These are the times the latest few days:
INFO [GossipTasks:1] 2016-02-19 05:06:21,087 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-19 14:33:38,424 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-20 07:21:25,626 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-20 11:34:46,766 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-21 08:00:07,518 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-21 10:36:58,788 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-22 07:10:40,304 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-23 08:59:05,392 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-23 12:22:59,562 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN


2016-02-23 18:01 GMT+01:00 daemeon reiydelle <daemeonr@gmail.com>:

> If you can, do a few (short, maybe 10m records, delete the default schema
> between executions) run of Cassandra Stress test against your production
> cluster (replication=3, force quorum to 3). Look for latency max in the 10s
> of SECONDS. If your devops team is running a monitoring tool that looks at
> the network, look for timeout/retries/errors/lost packets, etc. during the
> run (worst case you need to do netstats runs against the relevant nic e.g.
> every 10 seconds on the CassStress node, look for jumps in this count (if
> monitoring is enabled, look at the monitor's results for ALL of your nodes.
> At least one is having some issues.
>
>
> *.......*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Tue, Feb 23, 2016 at 8:43 AM, Jack Krupansky <jack.krupansky@gmail.com>
> wrote:
>
>> The reality of modern distributed systems is that connectivity between
>> nodes is never guaranteed and distributed software must be able to cope
>> with occasional absence of connectivity. GC and network connectivity are
>> the two issues that a lot of us are most familiar with. There may be others
>> - but most technical problems on a node would be clearly logged on that
>> node. If you see a lapse of connectivity no more than once or twice a day,
>> consider yourselves lucky.
>>
>> Is it only one node at a time that goes down, and at widely dispersed
>> times?
>>
>> How many nodes?
>>
>> -- Jack Krupansky
>>
>> On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson <
>> samuelsson.joel@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Version is 2.0.17.
>>> Yes, these are VMs in the cloud though I'm fairly certain they are on a
>>> LAN rather than WAN. They are both in the same data centre physically. The
>>> phi_convict_threshold is set to default. I'd rather find the root cause of
>>> the problem than just hiding it by not convicting a node if it isn't
>>> responding though. If pings are <2 ms without a single ping missed in
>>> several days, I highly doubt that network is the reason for the downtime.
>>>
>>> Best regards,
>>> Joel
>>>
>>> 2016-02-23 16:39 GMT+01:00 <SEAN_R_DURITY@homedepot.com>:
>>>
>>>> You didn't mention version, but I saw this kind of thing very often in
>>>> the 1.1 line. Often this is connected to network flakiness. Are these VMs?
>>>> In the cloud? Connected over a WAN? You mention that ping seems fine. Take
>>>> a look at the phi_convict_threshold in c assandra.yaml. You may need to
>>>> increase it to reduce the UP/DOWN flapping behavior.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Sean Durity
>>>>
>>>>
>>>>
>>>> *From:* Joel Samuelsson [mailto:samuelsson.joel@gmail.com]
>>>> *Sent:* Tuesday, February 23, 2016 9:41 AM
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: Nodes go down periodically
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> Thanks for your reply.
>>>>
>>>>
>>>>
>>>> I have debug logging on and see no GC pauses that are that long. GC
>>>> pauses are all well below 1s and 99 times out of 100 below 100ms.
>>>>
>>>> Do I need to enable GC log options to see the pauses?
>>>>
>>>> I see plenty of these lines:
>>>> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
>>>> 118) GC for ParNew: 24 ms for 1 collections
>>>>
>>>> as well as a few CMS GC log lines.
>>>>
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Joel
>>>>
>>>>
>>>>
>>>> 2016-02-23 15:14 GMT+01:00 Hannu Kröger <hkroger@gmail.com>:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> Those are probably GC pauses. Memory tuning is probably needed. Check
>>>> the parameters that you already have customised if they make sense.
>>>>
>>>>
>>>>
>>>> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
>>>>
>>>>
>>>>
>>>> Hannu
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 23 Feb 2016, at 16:08, Joel Samuelsson <samuelsson.joel@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> Our nodes go down periodically, around 1-2 times each day. Downtime is
>>>> from <1 second to 30 or so seconds.
>>>>
>>>>
>>>>
>>>> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
>>>> InetAddress /109.74.13.67 is now DOWN
>>>>
>>>>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
>>>> (line 978) InetAddress /109.74.13.67 is now UP
>>>>
>>>>
>>>>
>>>> I find nothing odd in the logs around the same time. I logged a ping
>>>> with timestamp and checked during the same time and saw nothing weird (ping
>>>> is less than 2ms at all times).
>>>>
>>>>
>>>>
>>>> Does anyone have any suggestions as to why this might happen?
>>>>
>>>>
>>>>
>>>> Best regards,
>>>> Joel
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> The information in this Internet Email is confidential and may be
>>>> legally privileged. It is intended solely for the addressee. Access to this
>>>> Email by anyone else is unauthorized. If you are not the intended
>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>> When addressed to our clients any opinions or advice contained in this
>>>> Email are subject to the terms and conditions expressed in any applicable
>>>> governing The Home Depot terms of business or client engagement letter. The
>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>> content of this attachment and for any damages or losses arising from any
>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>> items of a destructive nature, which may be contained in this attachment
>>>> and shall not be liable for direct, indirect, consequential or special
>>>> damages in connection with this e-mail message or its attachment.
>>>>
>>>
>>>
>>
>

[Attachment #3 (text/html)]

<div dir="ltr"><span style="font-size:12.8px">&quot;Is it only one node at a time \
that goes down, and at widely dispersed times?&quot;</span><br><div><span \
style="font-size:12.8px">It is a two node cluster so both nodes consider the other \
node down at the same time.</span></div><div><span \
style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">These \
are the times the latest few days:</span></div><div><div style=""><span \
style="font-size:12.8px">INFO [GossipTasks:1] 2016-02-19 05:06:21,087 Gossiper.java \
(line 992) InetAddress /x.x.x.x is now DOWN</span><br></div><div style="">INFO \
[GossipTasks:1] 2016-02-19 14:33:38,424 Gossiper.java (line 992) InetAddress /x.x.x.x \
is now DOWN</div><div style=""><div>INFO [GossipTasks:1] 2016-02-20 07:21:25,626 \
Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN</div><div>INFO \
[GossipTasks:1] 2016-02-20 11:34:46,766 Gossiper.java (line 992) InetAddress /x.x.x.x \
is now DOWN</div><div>INFO [GossipTasks:1] 2016-02-21 08:00:07,518 Gossiper.java \
(line 992) InetAddress /x.x.x.x is now DOWN</div><div>INFO [GossipTasks:1] 2016-02-21 \
10:36:58,788 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN</div><div>INFO \
[GossipTasks:1] 2016-02-22 07:10:40,304 Gossiper.java (line 992) InetAddress /x.x.x.x \
is now DOWN</div><div>INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java \
(line 992) InetAddress /x.x.x.x is now DOWN</div><div>INFO [GossipTasks:1] 2016-02-23 \
08:59:05,392 Gossiper.java (line 992) InetAddress /x.x.x.x is now DOWN</div><div>INFO \
[GossipTasks:1] 2016-02-23 12:22:59,562 Gossiper.java (line 992) InetAddress /x.x.x.x \
is now DOWN</div></div><div style="font-size:12.8px"><br></div></div></div><div \
class="gmail_extra"><br><div class="gmail_quote">2016-02-23 18:01 GMT+01:00 daemeon \
reiydelle <span dir="ltr">&lt;<a href="mailto:daemeonr@gmail.com" \
target="_blank">daemeonr@gmail.com</a>&gt;</span>:<br><blockquote class="gmail_quote" \
style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div \
dir="ltr"><div class="gmail_default" style="font-family:comic sans \
ms,sans-serif;color:rgb(7,55,99)">If you can, do a few (short, maybe 10m records, \
delete the default schema between executions) run of Cassandra Stress test against \
your production cluster (replication=3, force quorum to 3). Look for latency max in \
the 10s of SECONDS. If your devops team is running a monitoring tool that looks at \
the network, look for timeout/retries/errors/lost packets, etc. during the run (worst \
case you need to do netstats runs against the relevant nic e.g. every 10 seconds on \
the CassStress node, look for jumps in this count (if monitoring is enabled, look at \
the monitor&#39;s results for ALL of your nodes. At least one is having some \
issues.<br></div></div><div class="gmail_extra"><br clear="all"><div><div><div \
dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><span \
style="color:rgb(56,118,29)"><span style="background-color:rgb(255,255,255)"><b><span \
style="font-family:comic sans ms,sans-serif"></span></b></span></span><span \
style="color:rgb(56,118,29)"><span style="background-color:rgb(255,255,255)"><b><span \
style="font-family:comic sans \
ms,sans-serif"><br>.......</span></b></span></span><span \
style="color:rgb(56,118,29)"><span style="background-color:rgb(255,255,255)"><b><span \
style="font-family:comic sans ms,sans-serif"><br><br>Daemeon C.M. Reiydelle<br>USA <a \
href="tel:%28%2B1%29%20415.501.0198" value="+14155010198" target="_blank">(+1) \
415.501.0198</a><br>London <a href="tel:%28%2B44%29%20%280%29%2020%208144%209872" \
value="+442081449872" target="_blank">(+44) (0) 20 8144 \
9872</a></span></b></span></span><font \
size="1"><i><br></i></font></div></div></div></div></div></div></div><div><div \
class="h5"> <br><div class="gmail_quote">On Tue, Feb 23, 2016 at 8:43 AM, Jack \
Krupansky <span dir="ltr">&lt;<a href="mailto:jack.krupansky@gmail.com" \
target="_blank">jack.krupansky@gmail.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div dir="ltr">The reality of modern distributed systems is \
that connectivity between nodes is never guaranteed and distributed software must be \
able to cope with occasional absence of connectivity. GC and network connectivity are \
the two issues that a lot of us are most familiar with. There may be others - but \
most technical problems on a node would be clearly logged on that node. If you see a \
lapse of connectivity no more than once or twice a day, consider yourselves \
lucky.<div><br></div><div>Is it only one node at a time that goes down, and at widely \
dispersed times?</div><div><br></div><div>How many nodes?</div></div><div \
class="gmail_extra"><span><font color="#888888"><br clear="all"><div><div><div \
dir="ltr">-- Jack Krupansky</div></div></div></font></span><div><div> <br><div \
class="gmail_quote">On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson <span \
dir="ltr">&lt;<a href="mailto:samuelsson.joel@gmail.com" \
target="_blank">samuelsson.joel@gmail.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div dir="ltr">Hi,<div><br></div><div>Version is \
2.0.17.</div><div>Yes, these are VMs in the cloud though I&#39;m fairly certain they \
are on a LAN rather than WAN. They are both in the same data centre physically. The \
phi_convict_threshold is set to default. I&#39;d rather find the root cause of the \
problem than just hiding it by not convicting a node if it isn&#39;t responding \
though. If pings are &lt;2 ms without a single ping missed in several days, I highly \
doubt that network is the reason for the downtime.</div><div><br></div><div>Best \
regards,</div><div>Joel</div></div><div><div><div class="gmail_extra"><br><div \
class="gmail_quote">2016-02-23 16:39 GMT+01:00  <span dir="ltr">&lt;<a \
href="mailto:SEAN_R_DURITY@homedepot.com" \
target="_blank">SEAN_R_DURITY@homedepot.com</a>&gt;</span>:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex">





<div link="blue" vlink="purple" lang="EN-US">
<div>
<p class="MsoNormal"><span \
style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">You \
didn't mention version, but I saw this kind of thing very often in the 1.1 line. \
Often this is connected to network flakiness. Are these VMs? In the cloud?  Connected \
over a WAN? You mention that ping seems fine. Take a look at the \
phi_convict_threshold in c assandra.yaml. You may need to increase it to reduce the \
UP/DOWN flapping behavior. <u></u><u></u></span></p>
<p class="MsoNormal"><span \
style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u> \
<u></u></span></p> <p class="MsoNormal"><span \
style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u> \
<u></u></span></p> <p class="MsoNormal"><span \
style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">Sean \
Durity<u></u><u></u></span></p> <p class="MsoNormal"><span \
style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u> \
<u></u></span></p> <p class="MsoNormal"><b><span \
style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;">From:</span></b><span \
style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;"> Joel \
Samuelsson [mailto:<a href="mailto:samuelsson.joel@gmail.com" \
target="_blank">samuelsson.joel@gmail.com</a>] <br>
<b>Sent:</b> Tuesday, February 23, 2016 9:41 AM<br>
<b>To:</b> <a href="mailto:user@cassandra.apache.org" \
target="_blank">user@cassandra.apache.org</a><br> <b>Subject:</b> Re: Nodes go down \
periodically<u></u><u></u></span></p><div><div> <p class="MsoNormal"><u></u>  \
<u></u></p> <div>
<p class="MsoNormal">Hi,<u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u>  <u></u></p>
</div>
<div>
<p class="MsoNormal">Thanks for your reply.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>  <u></u></p>
</div>
<div>
<p class="MsoNormal">I have debug logging on and see no GC pauses that are that long. \
GC pauses are all well below 1s and 99 times out of 100 below 100ms.  \
<u></u><u></u></p> </div>
<div>
<p class="MsoNormal">Do I need to enable GC log options to see the \
pauses?<u></u><u></u></p> </div>
<div>
<p class="MsoNormal">I see plenty of these lines:<br>
DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line 118) GC for \
ParNew: 24 ms for 1 collections<u></u><u></u></p> </div>
<div>
<p class="MsoNormal">as well as a few CMS GC log lines.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>  <u></u></p>
</div>
<div>
<p class="MsoNormal">Best regards,<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Joel<u></u><u></u></p>
</div>
</div>
<div>
<p class="MsoNormal"><u></u>  <u></u></p>
<div>
<p class="MsoNormal">2016-02-23 15:14 GMT+01:00 Hannu Kröger &lt;<a \
href="mailto:hkroger@gmail.com" \
target="_blank">hkroger@gmail.com</a>&gt;:<u></u><u></u></p> <blockquote \
style="border:none;border-left:solid #cccccc 1.0pt;padding:0in 0in 0in \
6.0pt;margin-left:4.8pt;margin-right:0in"> <div>
<p class="MsoNormal">Hi,<u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u>  <u></u></p>
</div>
<div>
<p class="MsoNormal">Those are probably GC pauses. Memory tuning is probably needed. \
Check the parameters that you already have customised if they make \
sense.<u></u><u></u></p> </div>
<div>
<p class="MsoNormal"><u></u>  <u></u></p>
</div>
<div>
<p class="MsoNormal"><a \
href="http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html" \
target="_blank">http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html</a><u></u><u></u></p>
 </div>
<div>
<p class="MsoNormal"><span style="color:#888888"><u></u>  <u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#888888">Hannu<u></u><u></u></span></p>
</div>
<div>
<div>
<div>
<p class="MsoNormal"><u></u>  <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>  <u></u></p>
<div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<p class="MsoNormal">On 23 Feb 2016, at 16:08, Joel Samuelsson &lt;<a \
href="mailto:samuelsson.joel@gmail.com" \
target="_blank">samuelsson.joel@gmail.com</a>&gt; wrote:<u></u><u></u></p> </div>
<p class="MsoNormal"><u></u>  <u></u></p>
<div>
<div>
<div>
<p class="MsoNormal">Our nodes go down periodically, around 1-2 times each day. \
Downtime is from &lt;1 second to 30 or so seconds.<u></u><u></u></p> </div>
<div>
<p class="MsoNormal"><u></u>  <u></u></p>
</div>
<div>
<p class="MsoNormal">INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line \
992) InetAddress /<a href="http://109.74.13.67/" target="_blank">109.74.13.67</a> is \
now DOWN<u></u><u></u></p> </div>
<div>
<p class="MsoNormal">  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 \
Gossiper.java (line 978) InetAddress /<a href="http://109.74.13.67/" \
target="_blank">109.74.13.67</a> is now UP<u></u><u></u></p> </div>
<div>
<p class="MsoNormal"><u></u>  <u></u></p>
</div>
<div>
<p class="MsoNormal">I find nothing odd in the logs around the same time. I logged a \
ping with timestamp and checked during the same time and saw nothing weird (ping is \
less than 2ms at all times).<u></u><u></u></p> </div>
<div>
<p class="MsoNormal"><u></u>  <u></u></p>
</div>
<div>
<p class="MsoNormal">Does anyone have any suggestions as to why this might \
happen?<u></u><u></u></p> </div>
<div>
<p class="MsoNormal"><u></u>  <u></u></p>
</div>
<div>
<p class="MsoNormal">Best regards,<br>
Joel<u></u><u></u></p>
</div>
</div>
</div>
</blockquote>
</div>
<p class="MsoNormal"><u></u>  <u></u></p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class="MsoNormal"><u></u>  <u></u></p>
</div>
</div></div></div>
<br>
<hr>
<font face="Arial" color="Gray" size="1"><br>
The information in this Internet Email is confidential and may be legally privileged. \
It is intended solely for the addressee. Access to this Email by anyone else is \
unauthorized. If you are not the intended recipient, any disclosure, copying, \
distribution  or any action taken or omitted to be taken in reliance on it, is \
prohibited and may be unlawful. When addressed to our clients any opinions or advice \
contained in this Email are subject to the terms and conditions expressed in any \
applicable governing The  Home Depot terms of business or client engagement letter. \
The Home Depot disclaims all responsibility and liability for the accuracy and \
content of this attachment and for any damages or losses arising from any \
inaccuracies, errors, viruses, e.g., worms, trojan  horses, etc., or other items of a \
destructive nature, which may be contained in this attachment and shall not be liable \
for direct, indirect, consequential or special damages in connection with this e-mail \
message or its attachment.<br> </font>
</div>

</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic