'MPI on Mesos'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mesos-user
Subject:    MPI on Mesos
From:       Stratos Dimopoulos <stratos.dimopoulos () gmail ! com>
Date:       2014-10-29 5:50:42
Message-ID: CAGWuDaFTgo9A_xpDiQSW4h_JjiXUVziTmYpBEKVkbq4jRuhR9w () mail ! gmail ! com
[Download RAW message or body]

Hi,

I am having a couple of issues trying to run MPI over Mesos. I am using
Mesos 0.20.0 on Ubuntu 12.04 and MPICH2.

- I was able to successfully (?) run a helloworld MPI program but still the
task appears as lost in the GUI. Here is the stack trace from the mpi
execution:

>> We've launched all our MPDs; waiting for them to come up
Got 1 mpd(s), running mpiexec
Running mpiexec


 *** Hello world from processor euca-10-2-235-206, rank 0 out of 1
processors ***

mpiexec completed, calling mpdallexit euca-10-2-248-74_57995
Task 0 in state 5
A task finished unexpectedly, calling mpdexit on euca-10-2-248-74_57995
mpdroot: perror msg: No such file or directory
mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
    probable cause:  no mpd daemon on this machine
    possible cause:  unix socket /tmp/mpd2.console_root has been removed
mpdexit (__init__ 1208): forked process failed; status=255
I1028 22:15:04.774554  4859 sched.cpp:747] Stopping framework
'20141028-203440-1257767434-5050-3638-0006'
2014-10-28 22:15:04,795:4819(0x7fd7b1422700):ZOO_INFO@zookeeper_close@2505:
Closing zookeeper sessionId=0x14959388d4e0020


And also in *executor stdout* I get:
sh -c 'mpd --noconsole --ncpus=1 --host=euca-10-2-248-74
--port=39237'Command exited with status 127 → command not found

and on *stderr*:
sh: 1 mpd: not found

I am assuming the messages on the executor's log files appear because after
mpiexec is completed the task is finished and the mpd ring is no longer
running - so it complains about not finding the mpd command, which normally
works fine.


- An other thing I would like to ask has to do with the procedure to follow
for running MPI on Mesos. So far, using Spark and Hadoop on Mesos, I was
used to have an executor shared on HDFS and there was no need to
distributed the code to the slaves. With MPI I had to distribute the
helloworld executable to slaves, because having it on HDFS didn't work.
Moreover I was expecting that the mpd ring would be started from Mesos (in
the same way that the hadoop jobtracker is started from Mesos for the
HadoopOnMesos implementations). Now I have to first run mpdboot before
being able to run mpi on Mesos. Is the above procedure what I should do or
I am missing something?

- Finally, in order to make MPI to work I had to install the
mesos.interface with pip and manually copy the native directory from the
python/dist-packages (native doesn't exist on the pip repo). And then I
realized there is the mpiexec-mesos.in file that it does all that - I can
update the README to be a little more clear if you want - I am guessing
someone else might also get confused with this.

thanks,
Stratos

[Attachment #3 (text/html)]

<div dir="ltr">Hi,<br><br>I am having a couple of issues trying to run MPI over \
Mesos. I am using Mesos 0.20.0 on Ubuntu 12.04 and MPICH2.<br><br>- I was able to \
successfully (?) run a helloworld MPI program but still the task appears as lost in \
the GUI. Here is the stack trace from the mpi execution:<div><br></div><div>&gt;&gt;  \
We&#39;ve launched all our MPDs; waiting for them to come up<br>Got 1 mpd(s), running \
mpiexec<br>Running mpiexec<br><br><br>  *** Hello world from processor \
euca-10-2-235-206, rank 0 out of 1 processors ***<br><br>mpiexec completed, calling \
mpdallexit euca-10-2-248-74_57995<br>Task 0 in state 5<br>A task finished \
unexpectedly, calling mpdexit on euca-10-2-248-74_57995<br>mpdroot: perror msg: No \
such file or directory<br>mpdroot: cannot connect to local mpd at: \
/tmp/mpd2.console_root<br>      probable cause:   no mpd daemon on this machine<br>   \
possible cause:   unix socket /tmp/mpd2.console_root has been removed<br>mpdexit \
(__init__ 1208): forked process failed; status=255<br>I1028 22:15:04.774554   4859 \
sched.cpp:747] Stopping framework \
&#39;20141028-203440-1257767434-5050-3638-0006&#39;<br>2014-10-28 \
22:15:04,795:4819(0x7fd7b1422700):ZOO_INFO@zookeeper_close@2505: Closing zookeeper \
sessionId=0x14959388d4e0020 <br><br></div><div><br>And also in <b>executor stdout</b> \
I get:<br>sh -c &#39;mpd --noconsole --ncpus=1 --host=euca-10-2-248-74 \
--port=39237&#39;Command exited with status 127 → command not found<br><br>and on \
<b>stderr</b>:<br>sh: 1 mpd: not found<br><br>I am assuming the messages on the \
executor&#39;s log files appear because after mpiexec is completed the task is \
finished and the mpd ring is no longer running - so it complains about not finding \
the mpd command, which normally works fine.<br><br><br>- An other thing I would like \
to ask has to do with the procedure to follow for running MPI on Mesos. So far, using \
Spark and Hadoop on Mesos, I was used to have an executor shared on HDFS and there \
was no need to distributed the code to the slaves. With MPI I had to distribute the \
helloworld executable to slaves, because having it on HDFS didn&#39;t work.   \
Moreover I was expecting that the mpd ring would be started from Mesos (in the same \
way that the hadoop jobtracker is started from Mesos for the HadoopOnMesos \
implementations). Now I have to first run mpdboot before being able to run mpi on \
Mesos. Is the above procedure what I should do or I am missing \
something?</div><div><br></div><div>- Finally, in order to make MPI to work I had to \
install the mesos.interface with pip and manually copy the native directory from the \
python/dist-packages (native doesn&#39;t exist on the pip repo). And then I realized \
there is the <a href="http://mpiexec-mesos.in">mpiexec-mesos.in</a> file that it does \
all that - I can update the README to be a little more clear if you want - I am \
guessing someone else might also get confused with \
this.<br></div><div><br></div><div>thanks,</div><div>Stratos</div></div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic