[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mesos-user
Subject:    Re: Mesos Master Crashes when Task launched with LAUNCH_GROUP fails
From:       Meng Zhu <mzhu () mesosphere ! com>
Date:       2019-02-28 23:01:57
Message-ID: CAAr6wuwbXCdakdbVVFZtCpaLVP=UgV=-VFB-8oBnZ9wBaujbcg () mail ! gmail ! com
[Download RAW message or body]

Hi Nimi:

Thanks for reporting this.

From the log snippet, looks like, when de-allocating resources, the agent
does not have the port resources that is supposed to have been allocated.
Can you provide the master log (which at least covers the period from when
the resources on the agent is offered to the crash point)? Also, can you
create a JIRA ticket and upload the log to there? (
https://issues.apache.org/jira/projects/MESOS/issues)

-Meng

On Thu, Feb 28, 2019 at 1:58 PM Nimi W <psnim2000@gmail.com> wrote:

> Hi,
>
> Mesos: 1.7.1
>
> I'm trying to debug an issue where if I launch a task using the
> LAUNCH_GROUP method,
> and the task fails to start, the mesos master will crash. I am using a
> custom framework
> I've built using the HTTP Scheduler API.
>
> When my framework received an offer - I return with an ACCEPT with this
> JSON:
>
> https://gist.github.com/nemosupremo/3b23c4e1ca0ab241376aa5b975993270
>
> I then receive the following UPDATE events:
>
> TASK_STARTING
> TASK_RUNNING
> TASK_FAILED
>
> My framework then immediately tries to relaunch the task on the next
> OFFERS:
>
> https://gist.github.com/nemosupremo/2b02443241c3bd002f04be034d8e64f7
>
> But between sometime when I get that event and try to acknowledge the
> TASK_FAILED event,
> the mesos master crashes with:
>
> Feb 28 21:34:02 master03 mesos-master[7124]: F0228 21:34:02.118693  7142
> sorter.hpp:357] Check failed: resources.at(slaveId).contains(toRemove)
> Resources disk(allocated: faust)(reservations: [(STATIC,faust)]):1;
> cpus(allocated: faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated:
> faust)(reservations: [(STATIC,faust)]):64 at agent
> 643078ba-8cb8-4582-b9c3-345d602506c8-S0 does not contain cpus(allocated:
> faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated:
> faust)(reservations: [(STATIC,faust)]):64; disk(allocated:
> faust)(reservations: [(STATIC,faust)]):1; ports(allocated:
> faust)(reservations: [(STATIC,faust)]):[7777-7777]
> Feb 28 21:34:02 master03 mesos-master[7124]: *** Check failure stack
> trace: ***
> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd935e48d
> google::LogMessage::Fail()
> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd9360240
> google::LogMessage::SendToLog()
> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd935e073
> google::LogMessage::Flush()
> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd9360c69
> google::LogMessageFatal::~LogMessageFatal()
> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd83d85f8
> mesos::internal::master::allocator::DRFSorter::unallocated()
> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd83a78af
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackAllocatedResources()
> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd83ba281
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::recoverResources()
> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd92a6631
> process::ProcessBase::consume()
> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd92c878a
> process::ProcessManager::resume()
> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd92cc4d6
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd6289c80
> (unknown)
> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd5da56ba
> start_thread
> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd5adb41d
> (unknown)
> Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Main process
> exited, code=killed, status=6/ABRT
> Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Unit entered
> failed state.
> Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Failed with
> result 'signal'.
>
> The entire process works with the older LAUNCH API (for some reason the
> docker task crashes with filesystem permission issues when using
> LAUNCH_GROUPS)
>

[Attachment #3 (text/html)]

<div dir="ltr"><div dir="ltr"><div dir="ltr">Hi Nimi:</div><div \
dir="ltr"><br></div><div>Thanks for reporting this.</div><div><br></div><div>From the \
log snippet, looks like, when de-allocating resources, the agent does not have the \
port resources that is supposed to have been allocated. Can you provide the master \
log (which at least covers the period from when the resources on the agent is offered \
to the crash point)? Also, can you create a JIRA ticket and upload the log to there? \
(<a href="https://issues.apache.org/jira/projects/MESOS/issues">https://issues.apache.org/jira/projects/MESOS/issues</a>)</div><div><br></div><div>-Meng \
</div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On \
Thu, Feb 28, 2019 at 1:58 PM Nimi W &lt;<a \
href="mailto:psnim2000@gmail.com">psnim2000@gmail.com</a>&gt; \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div \
dir="ltr"><div dir="ltr"><div>Hi,</div><div><br></div><div>Mesos: \
1.7.1</div><div><br></div><div>I&#39;m trying to debug an issue where if I launch a \
task using the LAUNCH_GROUP method,  </div><div>and the task fails to start, the \
mesos master will crash. I am using a custom framework</div><div>I&#39;ve built using \
the HTTP Scheduler API.</div><div><br></div><div>When my framework received an offer \
- I return with an ACCEPT with this JSON:</div><div><br></div><div><a \
href="https://gist.github.com/nemosupremo/3b23c4e1ca0ab241376aa5b975993270" \
target="_blank">https://gist.github.com/nemosupremo/3b23c4e1ca0ab241376aa5b975993270</a></div><div><br></div><div>I \
then receive the following UPDATE \
events:</div><div><br></div><div>TASK_STARTING</div><div>TASK_RUNNING</div><div>TASK_FAILED</div><div><br></div><div>My \
framework then immediately tries to relaunch the task on the next \
OFFERS:</div><div><br></div><div><a \
href="https://gist.github.com/nemosupremo/2b02443241c3bd002f04be034d8e64f7" \
target="_blank">https://gist.github.com/nemosupremo/2b02443241c3bd002f04be034d8e64f7</a></div><div><br></div><div>But \
between sometime when I get that event and try to acknowledge the TASK_FAILED \
event,</div><div>the mesos master crashes with:</div><div><br></div><div><font \
face="monospace, monospace">Feb 28 21:34:02 master03 mesos-master[7124]: F0228 \
21:34:02.118693   7142 sorter.hpp:357] Check failed: <a href="http://resources.at" \
target="_blank">resources.at</a>(slaveId).contains(toRemove) Resources \
disk(allocated: faust)(reservations: [(STATIC,faust)]):1; cpus(allocated: \
faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated: faust)(reservations: \
[(STATIC,faust)]):64 at agent 643078ba-8cb8-4582-b9c3-345d602506c8-S0 does not \
contain cpus(allocated: faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated: \
faust)(reservations: [(STATIC,faust)]):64; disk(allocated: faust)(reservations: \
[(STATIC,faust)]):1; ports(allocated: faust)(reservations: \
[(STATIC,faust)]):[7777-7777]</font></div><div><font face="monospace, monospace">Feb \
28 21:34:02 master03 mesos-master[7124]: *** Check failure stack trace: \
***</font></div><div><font face="monospace, monospace">Feb 28 21:34:02 master03 \
mesos-master[7124]:        @        0x7f1fd935e48d   \
google::LogMessage::Fail()</font></div><div><font face="monospace, monospace">Feb 28 \
21:34:02 master03 mesos-master[7124]:        @        0x7f1fd9360240   \
google::LogMessage::SendToLog()</font></div><div><font face="monospace, \
monospace">Feb 28 21:34:02 master03 mesos-master[7124]:        @        \
0x7f1fd935e073   google::LogMessage::Flush()</font></div><div><font face="monospace, \
monospace">Feb 28 21:34:02 master03 mesos-master[7124]:        @        \
0x7f1fd9360c69   google::LogMessageFatal::~LogMessageFatal()</font></div><div><font \
face="monospace, monospace">Feb 28 21:34:02 master03 mesos-master[7124]:        @     \
0x7f1fd83d85f8   mesos::internal::master::allocator::DRFSorter::unallocated()</font></div><div><font \
face="monospace, monospace">Feb 28 21:34:02 master03 mesos-master[7124]:        @     \
0x7f1fd83a78af   mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackAllocatedResources()</font></div><div><font \
face="monospace, monospace">Feb 28 21:34:02 master03 mesos-master[7124]:        @     \
0x7f1fd83ba281   mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::recoverResources()</font></div><div><font \
face="monospace, monospace">Feb 28 21:34:02 master03 mesos-master[7124]:        @     \
0x7f1fd92a6631   process::ProcessBase::consume()</font></div><div><font \
face="monospace, monospace">Feb 28 21:34:02 master03 mesos-master[7124]:        @     \
0x7f1fd92c878a   process::ProcessManager::resume()</font></div><div><font \
face="monospace, monospace">Feb 28 21:34:02 master03 mesos-master[7124]:        @     \
0x7f1fd92cc4d6   _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv</font></div><div><font \
face="monospace, monospace">Feb 28 21:34:02 master03 mesos-master[7124]:        @     \
0x7f1fd6289c80   (unknown)</font></div><div><font face="monospace, monospace">Feb 28 \
21:34:02 master03 mesos-master[7124]:        @        0x7f1fd5da56ba   \
start_thread</font></div><div><font face="monospace, monospace">Feb 28 21:34:02 \
master03 mesos-master[7124]:        @        0x7f1fd5adb41d   \
(unknown)</font></div><div><font face="monospace, monospace">Feb 28 21:34:02 master03 \
systemd[1]: mesos-master.service: Main process exited, code=killed, \
status=6/ABRT</font></div><div><font face="monospace, monospace">Feb 28 21:34:02 \
master03 systemd[1]: mesos-master.service: Unit entered failed \
state.</font></div><div><font face="monospace, monospace">Feb 28 21:34:02 master03 \
systemd[1]: mesos-master.service: Failed with result \
&#39;signal&#39;.</font></div><div><br></div><div>The entire process works with the \
older LAUNCH API (for some reason the docker task crashes with filesystem permission \
issues when using LAUNCH_GROUPS)</div></div></div></div> </blockquote></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic