[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cassandra-user
Subject:    Re: Incremental repairs getting stuck a lot
From:       James Brown <jbrown () easypost ! com>
Date:       2021-11-27 4:54:37
Message-ID: CALgeuKzDwu-qUrPuqg3UybThRqUvq4QPnPESxH6qOXkgJKSAog () mail ! gmail ! com
[Download RAW message or body]

I filed this as CASSANDRA-17172
<https://issues.apache.org/jira/browse/CASSANDRA-17172>

On Fri, Nov 26, 2021 at 5:33 PM Dinesh Joshi <djoshi@apache.org> wrote:

> Could you file a jira with the details?
> 
> Dinesh
> 
> On Nov 26, 2021, at 2:40 PM, James Brown <jbrown@easypost.com> wrote:
> 
> 
> We're on 4.0.1 and switched to incremental repairs a couple of months ago.
> They work fine about 95% of the time, but once in a while a session will
> get stuck and will have to be cancelled (with `nodetool repair_admin cancel
> -s <uuid>`). Typically the session will be in REPAIRING but nothing will
> actually be happening.
> 
> Output of nodetool repair_admin:
> 
> $ nodetool repair_admin
> id                                   | state     | last activity |
> coordinator                          | participants
> 
> 
> 
> 
> 
> > participants_wp
> 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 | REPAIRING | 6771 (s)      |
> /[fd00:ea51:d057:200:1:0:0:8e]:25472 |
> fd00:ea51:d057:200:1:0:0:8e,fd00:ea51:d057:200:1:0:0:8f,fd00:ea51:d057:200:1:0:0:92, \
> fd00:ea51:d057:100:1:0:0:571,fd00:ea51:d057:100:1:0:0:570,fd00:ea51:d057:200:1:0:0:9 \
> 3,fd00:ea51:d057:100:1:0:0:573,fd00:ea51:d057:200:1:0:0:90,fd00:ea51:d057:200:1:0:0: \
> 91,fd00:ea51:d057:100:1:0:0:572,fd00:ea51:d057:100:1:0:0:575,fd00:ea51:d057:100:1:0: \
> 0:574,fd00:ea51:d057:200:1:0:0:94,fd00:ea51:d057:100:1:0:0:577,fd00:ea51:d057:200:1:0:0:95,fd00:ea51:d057:100:1:0:0:576
>  |
> [fd00:ea51:d057:200:1:0:0:8e]:25472,[fd00:ea51:d057:200:1:0:0:8f]:25472,[fd00:ea51:d \
> 057:200:1:0:0:92]:25472,[fd00:ea51:d057:100:1:0:0:571]:25472,[fd00:ea51:d057:100:1:0 \
> :0:570]:25472,[fd00:ea51:d057:200:1:0:0:93]:25472,[fd00:ea51:d057:100:1:0:0:573]:254 \
> 72,[fd00:ea51:d057:200:1:0:0:90]:25472,[fd00:ea51:d057:200:1:0:0:91]:25472,[fd00:ea5 \
> 1:d057:100:1:0:0:572]:25472,[fd00:ea51:d057:100:1:0:0:575]:25472,[fd00:ea51:d057:100 \
> :1:0:0:574]:25472,[fd00:ea51:d057:200:1:0:0:94]:25472,[fd00:ea51:d057:100:1:0:0:577]:25472,[fd00:ea51:d057:200:1:0:0:95]:25472,[fd00:ea51:d057:100:1:0:0:576]:25472
>  
> Running `jstack` on the coordinator shows two repair threads, both idle:
> 
> "Repair#167:1" #602177 daemon prio=5 os_prio=0 cpu=9.60ms
> elapsed=57359.81s tid=0x00007fa6d1741800 nid=0x18e6c waiting on condition
> [0x00007fc529f9a000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
> - parking to wait for  <0x000000045ba93a18> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11
> /LockSupport.java:234)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11
>  /AbstractQueuedSynchronizer.java:2123)
> at java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11
> /LinkedBlockingQueue.java:458)
> at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11
> /ThreadPoolExecutor.java:1053)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11
> /ThreadPoolExecutor.java:1114)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11
> /ThreadPoolExecutor.java:628)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)
> 
> "Repair#170:1" #654814 daemon prio=5 os_prio=0 cpu=9.62ms elapsed=7369.98s
> tid=0x00007fa6aec09000 nid=0x1a96f waiting on condition
> [0x00007fc535aae000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
> - parking to wait for  <0x00000004c45bf7d8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11
> /LockSupport.java:234)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11
>  /AbstractQueuedSynchronizer.java:2123)
> at java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11
> /LinkedBlockingQueue.java:458)
> at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11
> /ThreadPoolExecutor.java:1053)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11
> /ThreadPoolExecutor.java:1114)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11
> /ThreadPoolExecutor.java:628)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)
> 
> nodetool netstats says there is nothing happening:
> 
> $ nodetool netstats | head -n 2
> Mode: NORMAL
> Not sending any streams.
> 
> There's nothing interesting in the logs for this repair; the last relevant
> thing was a bunch of "Created 0 sync tasks based on 6 merkle tree
> responses for 3a059b10-4ef6-11ec-
> 925f-8f7bcf0ba035 (took: 0ms)" and then back and forth for the last
> couple of hours with things like
> 
> 2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO  [OptionalTasks:1]
> LocalSessions.java:938 - Attempting to learn the outcome of unfinished
> local incremental repair session 3a059b10-4ef6-11ec-925f-8f7bcf0ba035
> 2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO  [AntiEntropyStage:1]
> LocalSessions.java:987 - Received StatusResponse for repair session
> 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 with state REPAIRING, which is not
> actionable. Doing nothing.
> 
> Typically, cancelling the session and rerunning with the exact same
> command line will succeed.
> --
> James Brown
> Engineer
> 
> 

-- 
James Brown
Engineer


[Attachment #3 (text/html)]

<div dir="ltr"><div class="gmail_default" style="font-family:times new roman,serif">I \
filed this as  <a href="https://issues.apache.org/jira/browse/CASSANDRA-17172">CASSANDRA-17172</a></div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Nov 26, 2021 at 5:33 PM \
Dinesh Joshi &lt;<a href="mailto:djoshi@apache.org">djoshi@apache.org</a>&gt; \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">Could \
you file a jira with the details?<br><br><div dir="ltr">Dinesh</div><div \
dir="ltr"><br><blockquote type="cite">On Nov 26, 2021, at 2:40 PM, James Brown &lt;<a \
href="mailto:jbrown@easypost.com" target="_blank">jbrown@easypost.com</a>&gt; \
wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><div \
dir="ltr"><div class="gmail_default" style="font-family:&quot;times new \
roman&quot;,serif">We&#39;re on 4.0.1 and switched to incremental repairs a couple of \
months ago. They work fine about 95% of the time, but once in a while a session will \
get stuck and will have to be cancelled (with `nodetool repair_admin cancel -s \
&lt;uuid&gt;`). Typically the session will be in REPAIRING but nothing will actually \
be happening.</div><div class="gmail_default" style="font-family:&quot;times new \
roman&quot;,serif"><br></div><div class="gmail_default" \
style="font-family:&quot;times new roman&quot;,serif">Output of nodetool \
repair_admin:</div><div class="gmail_default" style="font-family:&quot;times new \
roman&quot;,serif"><br></div><div class="gmail_default"><font face="monospace">$ \
nodetool repair_admin<br>id                                                    | \
state       | last activity | coordinator                                       | \
participants                                                                          \
| participants_wp<br>3a059b10-4ef6-11ec-925f-8f7bcf0ba035 | REPAIRING | 6771 (s)      \
| /[fd00:ea51:d057:200:1:0:0:8e]:25472 | \
fd00:ea51:d057:200:1:0:0:8e,fd00:ea51:d057:200:1:0:0:8f,fd00:ea51:d057:200:1:0:0:92,fd \
00:ea51:d057:100:1:0:0:571,fd00:ea51:d057:100:1:0:0:570,fd00:ea51:d057:200:1:0:0:93,fd \
00:ea51:d057:100:1:0:0:573,fd00:ea51:d057:200:1:0:0:90,fd00:ea51:d057:200:1:0:0:91,fd0 \
0:ea51:d057:100:1:0:0:572,fd00:ea51:d057:100:1:0:0:575,fd00:ea51:d057:100:1:0:0:574,fd \
00:ea51:d057:200:1:0:0:94,fd00:ea51:d057:100:1:0:0:577,fd00:ea51:d057:200:1:0:0:95,fd00:ea51:d057:100:1:0:0:576 \
| [fd00:ea51:d057:200:1:0:0:8e]:25472,[fd00:ea51:d057:200:1:0:0:8f]:25472,[fd00:ea51:d \
057:200:1:0:0:92]:25472,[fd00:ea51:d057:100:1:0:0:571]:25472,[fd00:ea51:d057:100:1:0:0 \
:570]:25472,[fd00:ea51:d057:200:1:0:0:93]:25472,[fd00:ea51:d057:100:1:0:0:573]:25472,[ \
fd00:ea51:d057:200:1:0:0:90]:25472,[fd00:ea51:d057:200:1:0:0:91]:25472,[fd00:ea51:d057 \
:100:1:0:0:572]:25472,[fd00:ea51:d057:100:1:0:0:575]:25472,[fd00:ea51:d057:100:1:0:0:5 \
74]:25472,[fd00:ea51:d057:200:1:0:0:94]:25472,[fd00:ea51:d057:100:1:0:0:577]:25472,[fd \
00:ea51:d057:200:1:0:0:95]:25472,[fd00:ea51:d057:100:1:0:0:576]:25472</font><br></div><div><br></div><div><div \
class="gmail_default"><span style="font-family:&quot;times new \
roman&quot;,serif">Running `</span><font face="monospace">jstack</font><font \
face="times new roman, serif">` on the coordinator shows two repair threads, both \
idle:</font></div><div class="gmail_default" style="font-family:&quot;times new \
roman&quot;,serif"><br></div><div class="gmail_default"><font \
face="monospace">&quot;Repair#167:1&quot; #602177 daemon prio=5 os_prio=0 cpu=9.60ms \
elapsed=57359.81s tid=0x00007fa6d1741800 nid=0x18e6c waiting on condition   \
[0x00007fc529f9a000]<br>     java.lang.Thread.State: TIMED_WAITING (parking)<br>	at \
jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)<br>	- parking to wait \
for   &lt;0x000000045ba93a18&gt; (a \
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)<br>	at \
java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)<br>	at \
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:2123)<br>	at \
java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11/LinkedBlockingQueue.java:458)<br>	at \
java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11/ThreadPoolExecutor.java:1053)<br>	at \
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1114)<br>	at \
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)<br>	at \
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)<br>	at \
java.lang.Thread.run(java.base@11.0.11/Thread.java:829)<br></font></div><div \
class="gmail_default"><font face="monospace"><br></font></div><font \
face="monospace">&quot;Repair#170:1&quot; #654814 daemon prio=5 os_prio=0 cpu=9.62ms \
elapsed=7369.98s tid=0x00007fa6aec09000 nid=0x1a96f waiting on condition   \
[0x00007fc535aae000]<br>     java.lang.Thread.State: TIMED_WAITING (parking)<br>	at \
jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)<br>	- parking to wait \
for   &lt;0x00000004c45bf7d8&gt; (a \
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)<br>	at \
java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)<br>	at \
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:2123)<br>	at \
java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11/LinkedBlockingQueue.java:458)<br>	at \
java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11/ThreadPoolExecutor.java:1053)<br>	at \
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1114)<br>	at \
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)<br>	at \
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)<br></font><div \
class="gmail_default"><font face="monospace">at \
java.lang.Thread.run(java.base@11.0.11/Thread.java:829)</font></div><div \
class="gmail_default" style="font-family:&quot;times new \
roman&quot;,serif"><br></div><div class="gmail_default"><font \
face="monospace">nodetool netstats</font><span style="font-family:&quot;times new \
roman&quot;,serif"> says there is nothing happening:</span></div><div \
class="gmail_default" style="font-family:&quot;times new \
roman&quot;,serif"><br></div><div class="gmail_default"><font face="monospace">$ \
nodetool netstats | head -n 2<br>Mode: NORMAL<br>Not sending any \
streams.</font><br></div><div class="gmail_default" style="font-family:&quot;times \
new roman&quot;,serif"><br></div><div class="gmail_default"><font face="times new \
roman, serif">There&#39;s nothing interesting in the logs for this repair; the last \
relevant thing was a bunch of &quot;</font><font face="monospace">Created 0 sync \
tasks based on 6 merkle tree responses for 3a059b10-4ef6-11ec-</font></div><font \
face="monospace">925f-8f7bcf0ba035 (took: 0ms)<span \
class="gmail_default"></span></font><span class="gmail_default" \
style="font-family:&quot;times new roman&quot;,serif">&quot; and then back and forth \
for the last couple of hours with things like</span></div><div><span \
class="gmail_default"><font face="monospace"><br></font></span></div><font \
face="monospace">2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO   \
[OptionalTasks:1] LocalSessions.java:938 - Attempting to learn the outcome of \
unfinished local incremental repair session \
3a059b10-4ef6-11ec-925f-8f7bcf0ba035<br></font><div><span class="gmail_default"><font \
face="monospace">2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO   \
[AntiEntropyStage:1] LocalSessions.java:987 - Received StatusResponse for repair \
session 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 with state REPAIRING, which is not \
actionable. Doing nothing.</font><font face="times new roman, \
serif"></font></span><br></div><div><br></div><div><div class="gmail_default" \
style="font-family:&quot;times new roman&quot;,serif">Typically, cancelling the \
session and rerunning with the exact same command line will succeed.</div></div>-- \
<br><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><span \
style="font-family:&quot;times new roman&quot;,serif">James Brown</span><div><span \
style="font-family:&quot;times new \
roman&quot;,serif">Engineer</span></div></div></div></div></div></div> \
</div></blockquote></div></blockquote></div><br clear="all"><div><br></div>-- \
<br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><span \
style="font-family:&quot;times new roman&quot;,serif">James Brown</span><div><span \
style="font-family:&quot;times new \
roman&quot;,serif">Engineer</span></div></div></div></div></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic