[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mesos-issues
Subject:    [jira] [Commented] (MESOS-3808) slave/containerizer/docker leaves orphan containers on restart of me
From:       "Timothy Chen (JIRA)" <jira () apache ! org>
Date:       2015-10-30 7:20:27
Message-ID: JIRA.12908349.1445978695000.105169.1446189627753 () Atlassian ! JIRA
[Download RAW message or body]


    [ https://issues.apache.org/jira/browse/MESOS-3808?page=com.atlassian.jira.plugin. \
system.issuetabpanels:comment-tabpanel&focusedCommentId=14982034#comment-14982034 ] 

Timothy Chen commented on MESOS-3808:
-------------------------------------

Destroy is not where the problem is, as destroy is meant to be called on container \
exit on a live slave. This bug is specifically when recovering, which is when recover \
is called.

> slave/containerizer/docker leaves orphan containers on restart of mesos-slave
> -----------------------------------------------------------------------------
> 
> Key: MESOS-3808
> URL: https://issues.apache.org/jira/browse/MESOS-3808
> Project: Mesos
> Issue Type: Bug
> Components: containerization, docker, slave
> Affects Versions: 0.25.0
> Environment: CoreOS. Running mesos-slave in a container.
> Reporter: Chris Fortier
> Assignee: Gilbert Song
> Original Estimate: 4h
> Remaining Estimate: 4h
> 
> We attempted to upgrade from Mesos 0.23 to 0.25 but noticed that Docker containers \
> launched by Mesos were being orphaned and not destroyed when the Mesos agent was \
> restarted. Relavent log output:
> {noformat}
> I1027 20:36:22.343880 23004 docker.cpp:535] Recovering Docker containers
> I1027 20:36:22.517032 23008 docker.cpp:639] Recovering container \
> 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' for executor \
> 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework \
> 20151016-161150-1902412554-5050-1-0000 I1027 20:36:22.517467 23008 docker.cpp:639] \
> Recovering container '77b1748e-f295-4eb5-9966-d7a3bba2fc31' for executor \
> 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework \
> 20151016-161150-1902412554-5050-1-0000 I1027 20:36:22.517817 23007 slave.cpp:4051] \
> Sending reconnect request to executor ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db \
> of framework 20151016-161150-1902412554-5050-1-0000 at \
> executor(1)@10.131.100.57:40596 I1027 20:36:22.518033 23007 slave.cpp:4051] Sending \
> reconnect request to executor ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of \
> framework 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:57469 \
> I1027 20:36:22.518038 23008 docker.cpp:1592] Executor for container \
> 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' has exited E1027 20:36:22.518070 23010 \
> socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107] \
> I1027 20:36:22.518084 23008 docker.cpp:1390] Destroying container \
> 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' I1027 20:36:22.518282 23008 docker.cpp:1592] \
> Executor for container '77b1748e-f295-4eb5-9966-d7a3bba2fc31' has exited I1027 \
> 20:36:22.518324 23008 docker.cpp:1390] Destroying container \
> '77b1748e-f295-4eb5-9966-d7a3bba2fc31' E1027 20:36:22.518357 23010 socket.hpp:174] \
> Shutdown failed on fd=13: Transport endpoint is not connected [107] I1027 \
> 20:36:22.518360 23008 docker.cpp:1494] Running docker stop on container \
> 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' I1027 20:36:22.518489 23008 docker.cpp:1494] \
> Running docker stop on container '77b1748e-f295-4eb5-9966-d7a3bba2fc31' I1027 \
> 20:36:22.518592 23005 slave.cpp:3433] Executor \
> 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework \
> 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status I1027 \
> 20:36:22.519127 23005 slave.cpp:2717] Handling status update TASK_LOST (UUID: \
> b07be363-433f-4a11-8c81-1f5787debc76) for task \
> ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework \
> 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0 I1027 20:36:22.519263 23005 \
> slave.cpp:3433] Executor 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework \
> 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status I1027 \
> 20:36:22.519300 23005 slave.cpp:2717] Handling status update TASK_LOST (UUID: \
> 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task \
> ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework \
> 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0 W1027 20:36:22.519498 23003 \
> docker.cpp:1002] Ignoring updating unknown container: \
> a2308dfc-ec2f-4687-ae92-f045dd2d3614 W1027 20:36:22.519611 23003 docker.cpp:1002] \
> Ignoring updating unknown container: 77b1748e-f295-4eb5-9966-d7a3bba2fc31 I1027 \
> 20:36:22.519691 23003 status_update_manager.cpp:322] Received status update \
> TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task \
> ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework \
> 20151016-161150-1902412554-5050-1-0000 I1027 20:36:22.519755 23003 \
> status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_LOST \
> (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task \
> ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework \
> 20151016-161150-1902412554-5050-1-0000 I1027 20:36:22.525867 23003 \
> status_update_manager.cpp:322] Received status update TASK_LOST (UUID: \
> 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task \
> ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework \
> 20151016-161150-1902412554-5050-1-0000 I1027 20:36:22.525907 23003 \
> status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_LOST \
> (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task \
> ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework \
> 20151016-161150-1902412554-5050-1-0000 W1027 20:36:22.526645 23009 slave.cpp:2968] \
> Dropping status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for \
> task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework \
> 20151016-161150-1902412554-5050-1-0000 sent by status update manager because the \
> slave is in RECOVERING state W1027 20:36:22.529747 23007 slave.cpp:2968] Dropping \
> status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task \
> ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework \
> 20151016-161150-1902412554-5050-1-0000 sent by status update manager because the \
> slave is in RECOVERING state I1027 20:36:24.518846 23004 slave.cpp:2666] Cleaning \
> up un-reregistered executors I1027 20:36:24.519011 23004 slave.cpp:4110] Finished \
> recovery {noformat}
> Docker output:
> {noformat}
> CONTAINER ID        IMAGE                             COMMAND                \
> CREATED              STATUS              PORTS               NAMES 8d0d69fe34d7     \
> libmesos/ubuntu                   "/bin/sh -c 'while s   About a minute ago   Up \
> About a minute                       \
> mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a1492e45-2fce-4ca4-bd16-edcef439ca31 \
> e4344cfbcc6d        libmesos/ubuntu                   "/bin/sh -c 'while s   About \
> a minute ago   Up About a minute                       \
> mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.c3624e67-7a27-4309-8aa4-365d3fd1bfe2 \
> 3ce690f3b872        libmesos/ubuntu                   "/bin/sh -c 'while s   4 \
> minutes ago        Up 4 minutes                            \
> mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a2308dfc-ec2f-4687-ae92-f045dd2d3614 \
> 5b4546d3087a        libmesos/ubuntu                   "/bin/sh -c 'while s   4 \
> minutes ago        Up 4 minutes                            \
> mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.77b1748e-f295-4eb5-9966-d7a3bba2fc31 \
> {noformat} After digging in to the issue it seems the below comment might be the \
> problem.  https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L97
>  It appears that the recovery command is still only sending the containerId and not \
> the frameworkId + containerId.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic