[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mesos-user
Subject:    Re: Updating FrameworkInfo settings
From:       Thomas Petr <tpetr () hubspot ! com>
Date:       2015-02-25 1:29:45
Message-ID: CAJRB3TGnHV15vGA7bvJ-nmdxKgPkOTkXdjTqJu7jPvcpN-OKUA () mail ! gmail ! com
[Download RAW message or body]

Aha, thanks! Will give this a shot tomorrow morning.

On Tuesday, February 24, 2015, Vinod Kone <vinodkone@apache.org> wrote:

> Changing FrameworkInfo (while keeping the FrameworkID) is not handled
> correctly by Mesos at the moment. This is what you currently need to do to
> propagate FrameworkInfo.checkpoint throughout the cluster.
>
> --> Update FrameworkInfo inside your framework and re-register with
> master. (Old FrameworkInfo is still cached at master and slaves).
> --> Failover the leading master. (New FrameworkInfo will be cached by new
> leading master).
> --> Hard restart (kill slave and wipe meta data) your slave in batches.
>
> The proper fix for this is tracked at:
> https://issues.apache.org/jira/browse/MESOS-703
>
> On Tue, Feb 24, 2015 at 4:23 PM, Zameer Manji <zmanji@twopensource.com
> <javascript:_e(%7B%7D,'cvml','zmanji@twopensource.com');>> wrote:
>
>> For anyone who is going to read this information in the future, this
>> works because the information in the replicated log can be recovered by the
>> master. In future releases of Mesos the master might store information
>> which cannot be recovered so please take extra care if you are going to do
>> this.
>>
>> On Tue, Feb 24, 2015 at 4:11 PM, Steve Niemitz <steve@tellapart.com
>> <javascript:_e(%7B%7D,'cvml','steve@tellapart.com');>> wrote:
>>
>>> Definitely don't change the frameworkID, we did that once and it was a
>>> disaster, for reasons described already.
>>>
>>> Here's what we did to force it on (as I can recall)
>>> - Change the startup flags for all masters to use the in memory DB
>>> instead of the replicated log (--registry=in_memory)
>>> - Restart all masters (not all at once, let them fail over)
>>> - Delete the replicated log on all masters
>>> - Ensure the framework is now registered with checkpoint = true (the
>>> slaves won't be yet howerver)
>>> - Remove the --registry flag from the masters and do a rolling restart
>>> again
>>> - Do another rolling restart of the masters
>>> *- At this point the framework will be persisted as checkpoint = true*
>>> - Now, restart your slaves.  Restarting them should cause them to pick
>>> up the new framework.  I'm not 100% sure if I deleted their state or not
>>> when I did this part, if it doesn't seem to take, try deleting their slave
>>> info on each one.
>>>
>>> On Tue, Feb 24, 2015 at 4:02 PM, Zameer Manji <zmanji@twopensource.com
>>> <javascript:_e(%7B%7D,'cvml','zmanji@twopensource.com');>> wrote:
>>>
>>>> I would like to point out that using a new FrameworkID is not a
>>>> solution to this problem. This means that a cluster operator has to drain
>>>> the entire cluster to enable checkpointing, or lose all previous tasks.
>>>> Both scenarios are not desirable.
>>>>
>>>> Fortunately it is possible to do this without changing the FrameworkID.
>>>> I have cced Steve from TellApart who has enabled checkpointing without
>>>> changing the FrameworkID on a production cluster. I hope he can share his
>>>> process here.
>>>>
>>>> On Tue, Feb 24, 2015 at 3:51 PM, Tim Chen <tim@mesosphere.io
>>>> <javascript:_e(%7B%7D,'cvml','tim@mesosphere.io');>> wrote:
>>>>
>>>>> Mesos checkpoints the FrameworkInfo into disk, and recovers it on
>>>>> relaunch.
>>>>>
>>>>> I don't think we expose any API to remove the framework manually
>>>>> though if you really want to keep the FrameworkID. If you hit the failover
>>>>> timeout the framework will get removed from the master and slave.
>>>>>
>>>>> I think for now the best way is just use a new FrameworkID when you
>>>>> want to change the FrameworkInfo.
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr <tpetr@hubspot.com
>>>>> <javascript:_e(%7B%7D,'cvml','tpetr@hubspot.com');>> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> Is there a best practice for rolling out FrameworkInfo changes? We
>>>>>> need to set checkpoint to true, so I redeployed our framework with
>>>>>> the new settings (with tasks still running), but when I hit a slave's
>>>>>> stats.json endpoint, it appears that the old FrameworkInfo data is
>>>>>> still there (which makes sense since there's active executors running). I
>>>>>> then tried draining the tasks and completely restarting a Mesos slave, but
>>>>>> still no luck.
>>>>>>
>>>>>> Is there anything additional / special I need to do here? Is some
>>>>>> part of Mesos caching FrameworkInfo based on the framework ID?
>>>>>>
>>>>>> Another wrinkle with our setup is we have a rather large
>>>>>> failover_timeout set for the framework -- maybe that's affecting
>>>>>> things too?
>>>>>>
>>>>>> Thanks,
>>>>>> Tom
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Zameer Manji
>>>>
>>>
>>>
>>
>>
>> --
>> Zameer Manji
>>
>
>

[Attachment #3 (text/html)]

Aha, thanks! Will give this a shot tomorrow morning.<span></span><br><br>On Tuesday, \
February 24, 2015, Vinod Kone &lt;<a \
href="mailto:vinodkone@apache.org">vinodkone@apache.org</a>&gt; wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div dir="ltr">Changing FrameworkInfo (while keeping the \
FrameworkID) is not handled correctly by Mesos at the moment. This is what you \
currently need to do to propagate FrameworkInfo.checkpoint throughout the \
cluster.<div><br></div><div>--&gt; Update FrameworkInfo inside your framework and \
re-register with master. (Old FrameworkInfo is still cached at master and \
slaves).</div><div>--&gt; Failover the leading master. (New FrameworkInfo will be \
cached by new leading master).</div><div>--&gt; Hard restart (kill slave and wipe \
meta data) your slave in batches.</div><div><br></div><div>The proper fix for this is \
tracked at:  <a href="https://issues.apache.org/jira/browse/MESOS-703" \
target="_blank">https://issues.apache.org/jira/browse/MESOS-703</a></div><div \
class="gmail_extra"><br><div class="gmail_quote">On Tue, Feb 24, 2015 at 4:23 PM, \
Zameer Manji <span dir="ltr">&lt;<a \
href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;zmanji@twopensource.com&#39;);" \
target="_blank">zmanji@twopensource.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div dir="ltr">For anyone who is going to read this \
information in the future, this works because the information in the replicated log \
can be recovered by the master. In future releases of Mesos the master might store \
information which cannot be recovered so please take extra care if you are going to \
do this.</div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Feb 24, \
2015 at 4:11 PM, Steve Niemitz <span dir="ltr">&lt;<a \
href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;steve@tellapart.com&#39;);" \
target="_blank">steve@tellapart.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div dir="ltr">Definitely don&#39;t change the frameworkID, \
we did that once and it was a disaster, for reasons described \
already.<div><br></div><div>Here&#39;s what we did to force it on (as I can \
recall)</div><div>- Change the startup flags for all masters to use the in memory DB \
instead of the replicated log (--<span \
style="color:rgb(51,51,51);font-family:&#39;Helvetica \
Neue&#39;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20px;background-color:rgb(249,249,249)">registry=in_memory)</span></div><div><span \
style="color:rgb(51,51,51);font-family:&#39;Helvetica \
Neue&#39;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20px;background-color:rgb(249,249,249)">- \
Restart all masters (not all at once, let them fail over)</span></div><div><span \
style="color:rgb(51,51,51);font-family:&#39;Helvetica \
Neue&#39;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20px;background-color:rgb(249,249,249)">- \
Delete the replicated log on all masters</span></div><div><span \
style="color:rgb(51,51,51);font-family:&#39;Helvetica \
Neue&#39;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20px;background-color:rgb(249,249,249)">- \
Ensure the framework is now registered with checkpoint = true (the slaves won&#39;t \
be yet howerver)</span></div><div><span \
style="color:rgb(51,51,51);font-family:&#39;Helvetica \
Neue&#39;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20px;background-color:rgb(249,249,249)">- \
Remove the --registry flag from the masters and do a rolling restart \
again</span></div><div><span style="color:rgb(51,51,51);font-family:&#39;Helvetica \
Neue&#39;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20px;background-color:rgb(249,249,249)">- \
Do another rolling restart of the masters</span></div><div><span \
style="background-color:rgb(249,249,249)"><font color="#333333" face="Helvetica Neue, \
Helvetica, Arial, sans-serif"><span style="font-size:14px;line-height:20px"><b>- At \
this point the framework will be  persisted  as checkpoint = \
true</b></span></font></span></div><div><font color="#333333" face="Helvetica Neue, \
Helvetica, Arial, sans-serif"><span \
style="font-size:14px;line-height:20px;background-color:rgb(249,249,249)">- Now, \
restart your slaves.   Restarting them should cause them to pick up the new \
framework.   I&#39;m not 100% sure if I deleted their state or not when I did this \
part, if it doesn&#39;t seem to take, try deleting their slave info on each \
one.</span></font></div></div><div><div><div><div><div class="gmail_extra"><br><div \
class="gmail_quote">On Tue, Feb 24, 2015 at 4:02 PM, Zameer Manji <span \
dir="ltr">&lt;<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;zmanji@twopensource.com&#39;);" \
target="_blank">zmanji@twopensource.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div dir="ltr">I would like to point out that using a new \
FrameworkID is not a solution to this problem. This means that a cluster operator has \
to drain the entire cluster to enable checkpointing, or lose all previous tasks. Both \
scenarios are not desirable.<div><br></div><div>Fortunately it is possible to do this \
without changing the FrameworkID. I have cced Steve from TellApart who has enabled \
checkpointing without changing the FrameworkID on a production cluster. I hope he can \
share his process here.</div></div><div class="gmail_extra"><br><div \
class="gmail_quote">On Tue, Feb 24, 2015 at 3:51 PM, Tim Chen <span dir="ltr">&lt;<a \
href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;tim@mesosphere.io&#39;);" \
target="_blank">tim@mesosphere.io</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div dir="ltr">Mesos checkpoints the FrameworkInfo into disk, \
and recovers it on relaunch.<div><br></div><div>I don&#39;t think we expose any API \
to remove the framework manually though if you really want to keep the FrameworkID. \
If you hit the failover timeout the framework will get removed from the master and \
slave.</div><div><br></div><div>I think for now the best way is just use a new \
FrameworkID when you want to change the FrameworkInfo.</div><span><font \
color="#888888"><div><br></div><div>Tim<br><div><br></div><div><br></div></div></font></span></div><div><div><div \
class="gmail_extra"><br><div class="gmail_quote">On Tue, Feb 24, 2015 at 3:32 PM, \
Thomas Petr <span dir="ltr">&lt;<a \
href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;tpetr@hubspot.com&#39;);" \
target="_blank">tpetr@hubspot.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div dir="ltr">Hey folks,<div><br></div><div>Is there a best \
practice for rolling out FrameworkInfo changes? We need to set  <font \
face="monospace, monospace">checkpoint</font> to <font face="monospace, \
monospace">true</font><font face="arial, helvetica, sans-serif">, so  </font>I \
redeployed our framework with the new settings (with tasks still running), but when I \
hit a slave&#39;s <font face="monospace, monospace">stats.json</font> endpoint, it \
appears that the old FrameworkInfo data is still there (which makes sense since \
there&#39;s active executors running). I then tried draining the tasks and completely \
restarting a Mesos slave, but still no luck.</div><div><br></div><div>Is there \
anything additional / special I need to do here? Is some part of Mesos caching \
FrameworkInfo based on the framework ID?</div><div><br></div><div>Another wrinkle \
with our setup is we have a rather large <font face="monospace, \
monospace">failover_timeout</font> set for the framework -- maybe that&#39;s \
affecting things too?</div><div><br></div><div>Thanks,</div><div>Tom</div></div> \
</blockquote></div><br></div><span><font color="#888888"> \
</font></span></div></div></blockquote></div><span><font color="#888888"><br><br \
clear="all"><div><br></div>-- <br><div><div dir="ltr">Zameer Manji<br></div></div> \
</font></span></div> </blockquote></div><br></div>
</div></div></div></div></blockquote></div><span><font color="#888888"><br><br \
clear="all"><div><br></div>-- <br><div><div dir="ltr">Zameer Manji<br></div></div> \
</font></span></div> </blockquote></div><br></div></div>
</blockquote>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic