[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mesos-user
Subject:    Re: Duplicate task ID for same framework on different agents
From:       Benjamin Mahler <bmahler () apache ! org>
Date:       2017-12-22 3:29:39
Message-ID: CAFp_Nivpp8_Crjh6iwmwPV=D8vAH1Jm4phtqgP2s+hvtFEsPdA () mail ! gmail ! com
[Download RAW message or body]

It's a known issue:
https://issues.apache.org/jira/browse/MESOS-3070

Putting in place a protection mechanism sounds good, but is rather
complicated. See the comment in this ticket:
https://issues.apache.org/jira/browse/MESOS-6785

On Wed, Dec 20, 2017 at 8:26 PM, Zhitao Li <zhitaoli.cs@gmail.com> wrote:

> Hi all,
>
> We have seen a mesos master crash loop after a leader failover. After more
> investigation, it seems that a same task ID was managed to be created onto
> multiple Mesos agents in the cluster.
>
> One possible logical sequence which can lead to such problem:
>
> 1. Task T1 was launched to master M1 on agent A1 for framework F;
> 2. Master M1 failed over to M2;
> 3. Before A1 reregistered to M2, the same T1 was launched on to agent A2:
> M2 does not know previous T1 yet so it accepted it and sent to A2;
> 4. A1 reregistered: this probably crashed M2 (because same task cannot be
> added twice);
> 5. When M3 tries to come up after M2, it further crashes because both A1
> and A2 tried to add a T1 to the framework.
>
> (I only have logs to prove the last step right now)
>
> This happened on 1.4.0 masters.
>
> Although this is probably triggered by incorrect retry logic on framework
> side, I wonder whether Mesos master should do extra protection to prevent
> such issue to cause master crash loop. Some possible ideas are to instruct
> one of the agents carrying tasks w/ duplicate ID to terminate corresponding
> tasks, or just refuse to reregister such agents and instruct them to
> shutdown.
>
> I also filed MESOS-8353 <https://issues.apache.org/jira/browse/MESOS-8353>
> to track this potential bug. Thanks!
>
>
> --
>
> Cheers,
>
> Zhitao Li
>

[Attachment #3 (text/html)]

<div dir="ltr">It&#39;s a known issue:<div><a \
href="https://issues.apache.org/jira/browse/MESOS-3070">https://issues.apache.org/jira/browse/MESOS-3070</a><div><br></div><div>Putting \
in place a protection mechanism sounds good, but is rather complicated. See the \
comment in this ticket:</div><div><a \
href="https://issues.apache.org/jira/browse/MESOS-6785">https://issues.apache.org/jira/browse/MESOS-6785</a><br></div></div></div><div \
class="gmail_extra"><br><div class="gmail_quote">On Wed, Dec 20, 2017 at 8:26 PM, \
Zhitao Li <span dir="ltr">&lt;<a href="mailto:zhitaoli.cs@gmail.com" \
target="_blank">zhitaoli.cs@gmail.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex">Hi all,<br> <br>
We have seen a mesos master crash loop after a leader failover. After more<br>
investigation, it seems that a same task ID was managed to be created onto<br>
multiple Mesos agents in the cluster.<br>
<br>
One possible logical sequence which can lead to such problem:<br>
<br>
1. Task T1 was launched to master M1 on agent A1 for framework F;<br>
2. Master M1 failed over to M2;<br>
3. Before A1 reregistered to M2, the same T1 was launched on to agent A2:<br>
M2 does not know previous T1 yet so it accepted it and sent to A2;<br>
4. A1 reregistered: this probably crashed M2 (because same task cannot be<br>
added twice);<br>
5. When M3 tries to come up after M2, it further crashes because both A1<br>
and A2 tried to add a T1 to the framework.<br>
<br>
(I only have logs to prove the last step right now)<br>
<br>
This happened on 1.4.0 masters.<br>
<br>
Although this is probably triggered by incorrect retry logic on framework<br>
side, I wonder whether Mesos master should do extra protection to prevent<br>
such issue to cause master crash loop. Some possible ideas are to instruct<br>
one of the agents carrying tasks w/ duplicate ID to terminate corresponding<br>
tasks, or just refuse to reregister such agents and instruct them to<br>
shutdown.<br>
<br>
I also filed MESOS-8353 &lt;<a \
href="https://issues.apache.org/jira/browse/MESOS-8353" rel="noreferrer" \
target="_blank">https://issues.apache.org/<wbr>jira/browse/MESOS-8353</a>&gt;<br> to \
track this potential bug. Thanks!<br> <span class="HOEnZb"><font color="#888888"><br>
<br>
--<br>
<br>
Cheers,<br>
<br>
Zhitao Li<br>
</font></span></blockquote></div><br></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic