'Re: framework unregistration bug for Java and Python frameworks'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mesos-user
Subject:    Re: framework unregistration bug for Java and Python frameworks
From:       Benjamin Hindman <benjamin.hindman () gmail ! com>
Date:       2014-06-28 3:24:06
Message-ID: CAFeOQnUDyOFBFxsmqKHHMp6wyzcv+9gUvLwORWVrehCAMJkMdg () mail ! gmail ! com
[Download RAW message or body]

Forgot to include the JIRA link for folks to follow along:
https://issues.apache.org/jira/browse/MESOS-1550


On Fri, Jun 27, 2014 at 8:22 PM, Benjamin Hindman <
benjamin.hindman@gmail.com> wrote:

> If you have written or maintain a Mesos framework please read on.
>
> *What:* Today a long standing bug was found with the MesosSchedulerDriver
> for Java and Python that causes a framework to get unregistered with Mesos
> without the framework doing so explicitly.
>
> *How: *In the normal lifecycle of a framework the scheduler calls
> 'stop()' on it's instance of MesosSchedulerDriver when it's done using the
> driver. IMPORTANT: If the framework plans to failover it must pass 'true'
> to 'stop()', otherwise 'false' (the default).
>
> Some very old code (from before the introduction of the 'failover' boolean
> argument) that gets invoked when a Java or Python MesosSchedulerDriver gets
> garbaged collected was calling 'stop()' which was using the default
> semantics of 'false' indicating that the framework would not be failing
> over and reconnecting to Mesos.
>
> *Why:* In particular, why wasn't this bug found before? This behavior
> only occurs when the MesosSchedulerDriver instance explicitly gets garbaged
> collected _AND_ 'stop()' has not already been called. Moreover, in most
> applications that don't call stop the MesosSchedulerDriver does not get
> garbaged collected either because a reference is maintained for the
> lifetime of the application _OR_ the application is terminated before the
> garbage collector kicks in! Our best guess of why this was uncovered today
> is because, for whatever reason, the garbage collector kicked in and
> 'stop()' got invoked.
>
> *Short-term Mitigation:*
>
> (1) Never destroy your reference to MesosSchedulerDriver (so the garbage
> collector never cleans it up).
> (2) Always call 'stop(true)' after you're done with the
> MesosSchedulerDriver if you plan on failing over!
>
> In addition, we'll be releasing a *0.19.1* bug fix release which fixes
> this issue.
>
> Apologies for any inconveniences this may cause folks. Big thanks to
> Whitney Sorensen for reporting the bug and Vinod Kone for tracking it down.
>
> Ben.
>

[Attachment #3 (text/html)]

<div dir="ltr">Forgot to include the JIRA link for folks to follow along:  <a \
href="https://issues.apache.org/jira/browse/MESOS-1550">https://issues.apache.org/jira/browse/MESOS-1550</a></div><div \
class="gmail_extra"><br><br> <div class="gmail_quote">On Fri, Jun 27, 2014 at 8:22 \
PM, Benjamin Hindman <span dir="ltr">&lt;<a href="mailto:benjamin.hindman@gmail.com" \
target="_blank">benjamin.hindman@gmail.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"> <div dir="ltr">If you have written or maintain a Mesos \
framework please read on.<div><br></div><div><b>What:</b> Today a long standing bug \
was found with the MesosSchedulerDriver for Java and Python that causes a framework \
to get unregistered with Mesos without the framework doing so explicitly.</div>

<div><br></div><div><b>How:  </b>In the normal lifecycle of a framework the scheduler \
calls &#39;stop()&#39; on it&#39;s instance of MesosSchedulerDriver when it&#39;s \
done using the driver. IMPORTANT: If the framework plans to failover it must pass \
&#39;true&#39; to &#39;stop()&#39;, otherwise &#39;false&#39; (the default).</div>

<div><br></div><div>Some very old code (from before the introduction of the \
&#39;failover&#39; boolean argument) that gets invoked when a Java or Python \
MesosSchedulerDriver gets garbaged collected was calling &#39;stop()&#39; which was \
using the default semantics of &#39;false&#39; indicating that the framework would \
not be failing over and reconnecting to Mesos.</div>

<div><br></div><div><b>Why:</b>  In particular, why wasn&#39;t this bug found before? \
This behavior only occurs when the MesosSchedulerDriver instance explicitly gets \
garbaged collected _AND_ &#39;stop()&#39; has not already been called. Moreover, in \
most applications that don&#39;t call stop the MesosSchedulerDriver does not get \
garbaged collected either because a reference is maintained for the lifetime of the \
application _OR_ the application is terminated before the garbage collector kicks in! \
Our best guess of why this was uncovered today is because, for whatever reason, the \
garbage collector kicked in and &#39;stop()&#39; got invoked.</div>

<div><br></div><div><b>Short-term Mitigation:</b></div><div><br></div><div>(1) Never \
destroy your reference to MesosSchedulerDriver (so the garbage collector never cleans \
it up).</div><div>(2) Always call &#39;stop(true)&#39; after you&#39;re done with the \
MesosSchedulerDriver if you plan on failing over!</div>

<div><br></div><div>In addition, we&#39;ll be releasing a <b>0.19.1</b>  bug fix \
release which fixes this issue.</div><div><br></div><div>Apologies for any \
inconveniences this may cause folks. Big thanks to Whitney Sorensen for reporting the \
bug and Vinod Kone for tracking it down.</div>

<div><br></div><div>Ben.</div></div>
</blockquote></div><br></div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic