'Re: cluster confusion after zookeeper blip'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mesos-user
Subject:    Re: cluster confusion after zookeeper blip
From:       Jeff Schroeder <jeffschroeder () computer ! org>
Date:       2015-05-18 21:19:45
Message-ID: CAMReA90qfaBOWSwmKgbA6uOYQWp4K54XPY+9aUa=8DcHtA7wag () mail ! gmail ! com
[Download RAW message or body]

Not that this is super helpful for your issue, but I ran into an identical
problem this morning with Aurora ontop of mesos where the scheduler was
inoperable due to my ZK ensemble losing quorum and generally acting bad.
However as soon as I fixed the quorum things immediately recovered. I
believe it had to do with the replicated log that Aurora uses.

On Monday, May 18, 2015, Dick Davies <dick@hellooperator.net> wrote:

> We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves.
> (mesos 0.21.0, marathon 0.7.5)
>
> This morning we had a network outage long enough for everything to
> lose zookeeper.
> Now our marathon UI is empty (all 3 marathons think someone else is a
> master, and
> marathons 'proxy to leader' feature means the REST API is toast).
>
> Odd thing is, at the mesos level, the
> mesos master UI shows no tasks running (logs mention orphaned tasks),
> but if i click into the 'slaves' tab and dig down, the slave view details
> tasks
> that are in fact active.
>
> Any way to bring order to this without needing to kill those tasks? we
> have no actual outage from a user point of view, but the cluster
> itself is pretty confused and our service discovery relies on the
> marathon API which is timing out.
>
> Although mesos has checkpointing enabled, marathon isn't running with
> checkpointing on (it's the default now but doesn't apply to existing
> frameworks apparently, and we started this around marathon 0.4.x)
>
> Would enabling checkpointing help with this kind of issue? If so, how
> do i enable it for an existing framework?
>


-- 
Text by Jeff, typos by iPhone

[Attachment #3 (text/html)]

Not that this is super helpful for your issue, but I ran into an identical problem \
this morning with Aurora ontop of mesos where the scheduler was inoperable due to my \
ZK ensemble losing quorum and generally acting bad. However as soon as I fixed the \
quorum things immediately recovered. I believe it had to do with the replicated log \
that Aurora uses.<span></span><div><br>On Monday, May 18, 2015, Dick Davies &lt;<a \
href="mailto:dick@hellooperator.net">dick@hellooperator.net</a>&gt; \
wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px \
#ccc solid;padding-left:1ex">We run a 3 node marathon cluster on top of 3 mesos \
masters + 6 slaves.<br> (mesos 0.21.0, marathon 0.7.5)<br>
<br>
This morning we had a network outage long enough for everything to<br>
lose zookeeper.<br>
Now our marathon UI is empty (all 3 marathons think someone else is a<br>
master, and<br>
marathons &#39;proxy to leader&#39; feature means the REST API is toast).<br>
<br>
Odd thing is, at the mesos level, the<br>
mesos master UI shows no tasks running (logs mention orphaned tasks),<br>
but if i click into the &#39;slaves&#39; tab and dig down, the slave view details \
tasks<br> that are in fact active.<br>
<br>
Any way to bring order to this without needing to kill those tasks? we<br>
have no actual outage from a user point of view, but the cluster<br>
itself is pretty confused and our service discovery relies on the<br>
marathon API which is timing out.<br>
<br>
Although mesos has checkpointing enabled, marathon isn&#39;t running with<br>
checkpointing on (it&#39;s the default now but doesn&#39;t apply to existing<br>
frameworks apparently, and we started this around marathon 0.4.x)<br>
<br>
Would enabling checkpointing help with this kind of issue? If so, how<br>
do i enable it for an existing framework?<br>
</blockquote></div><br><br>-- <br>Text by Jeff, typos by iPhone<br>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic