[prev in list] [next in list] [prev in thread] [next in thread]
List: mesos-user
Subject: MPI on Mesos
From: Stratos Dimopoulos <stratos.dimopoulos () gmail ! com>
Date: 2014-10-29 5:50:42
Message-ID: CAGWuDaFTgo9A_xpDiQSW4h_JjiXUVziTmYpBEKVkbq4jRuhR9w () mail ! gmail ! com
[Download RAW message or body]
Hi,
I am having a couple of issues trying to run MPI over Mesos. I am using
Mesos 0.20.0 on Ubuntu 12.04 and MPICH2.
- I was able to successfully (?) run a helloworld MPI program but still the
task appears as lost in the GUI. Here is the stack trace from the mpi
execution:
>> We've launched all our MPDs; waiting for them to come up
Got 1 mpd(s), running mpiexec
Running mpiexec
*** Hello world from processor euca-10-2-235-206, rank 0 out of 1
processors ***
mpiexec completed, calling mpdallexit euca-10-2-248-74_57995
Task 0 in state 5
A task finished unexpectedly, calling mpdexit on euca-10-2-248-74_57995
mpdroot: perror msg: No such file or directory
mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
probable cause: no mpd daemon on this machine
possible cause: unix socket /tmp/mpd2.console_root has been removed
mpdexit (__init__ 1208): forked process failed; status=255
I1028 22:15:04.774554 4859 sched.cpp:747] Stopping framework
'20141028-203440-1257767434-5050-3638-0006'
2014-10-28 22:15:04,795:4819(0x7fd7b1422700):ZOO_INFO@zookeeper_close@2505:
Closing zookeeper sessionId=0x14959388d4e0020
And also in *executor stdout* I get:
sh -c 'mpd --noconsole --ncpus=1 --host=euca-10-2-248-74
--port=39237'Command exited with status 127 → command not found
and on *stderr*:
sh: 1 mpd: not found
I am assuming the messages on the executor's log files appear because after
mpiexec is completed the task is finished and the mpd ring is no longer
running - so it complains about not finding the mpd command, which normally
works fine.
- An other thing I would like to ask has to do with the procedure to follow
for running MPI on Mesos. So far, using Spark and Hadoop on Mesos, I was
used to have an executor shared on HDFS and there was no need to
distributed the code to the slaves. With MPI I had to distribute the
helloworld executable to slaves, because having it on HDFS didn't work.
Moreover I was expecting that the mpd ring would be started from Mesos (in
the same way that the hadoop jobtracker is started from Mesos for the
HadoopOnMesos implementations). Now I have to first run mpdboot before
being able to run mpi on Mesos. Is the above procedure what I should do or
I am missing something?
- Finally, in order to make MPI to work I had to install the
mesos.interface with pip and manually copy the native directory from the
python/dist-packages (native doesn't exist on the pip repo). And then I
realized there is the mpiexec-mesos.in file that it does all that - I can
update the README to be a little more clear if you want - I am guessing
someone else might also get confused with this.
thanks,
Stratos
[Attachment #3 (text/html)]
<div dir="ltr">Hi,<br><br>I am having a couple of issues trying to run MPI over \
Mesos. I am using Mesos 0.20.0 on Ubuntu 12.04 and MPICH2.<br><br>- I was able to \
successfully (?) run a helloworld MPI program but still the task appears as lost in \
the GUI. Here is the stack trace from the mpi execution:<div><br></div><div>>> \
We've launched all our MPDs; waiting for them to come up<br>Got 1 mpd(s), running \
mpiexec<br>Running mpiexec<br><br><br> *** Hello world from processor \
euca-10-2-235-206, rank 0 out of 1 processors ***<br><br>mpiexec completed, calling \
mpdallexit euca-10-2-248-74_57995<br>Task 0 in state 5<br>A task finished \
unexpectedly, calling mpdexit on euca-10-2-248-74_57995<br>mpdroot: perror msg: No \
such file or directory<br>mpdroot: cannot connect to local mpd at: \
/tmp/mpd2.console_root<br> probable cause: no mpd daemon on this machine<br> \
possible cause: unix socket /tmp/mpd2.console_root has been removed<br>mpdexit \
(__init__ 1208): forked process failed; status=255<br>I1028 22:15:04.774554 4859 \
sched.cpp:747] Stopping framework \
'20141028-203440-1257767434-5050-3638-0006'<br>2014-10-28 \
22:15:04,795:4819(0x7fd7b1422700):ZOO_INFO@zookeeper_close@2505: Closing zookeeper \
sessionId=0x14959388d4e0020 <br><br></div><div><br>And also in <b>executor stdout</b> \
I get:<br>sh -c 'mpd --noconsole --ncpus=1 --host=euca-10-2-248-74 \
--port=39237'Command exited with status 127 → command not found<br><br>and on \
<b>stderr</b>:<br>sh: 1 mpd: not found<br><br>I am assuming the messages on the \
executor's log files appear because after mpiexec is completed the task is \
finished and the mpd ring is no longer running - so it complains about not finding \
the mpd command, which normally works fine.<br><br><br>- An other thing I would like \
to ask has to do with the procedure to follow for running MPI on Mesos. So far, using \
Spark and Hadoop on Mesos, I was used to have an executor shared on HDFS and there \
was no need to distributed the code to the slaves. With MPI I had to distribute the \
helloworld executable to slaves, because having it on HDFS didn't work. \
Moreover I was expecting that the mpd ring would be started from Mesos (in the same \
way that the hadoop jobtracker is started from Mesos for the HadoopOnMesos \
implementations). Now I have to first run mpdboot before being able to run mpi on \
Mesos. Is the above procedure what I should do or I am missing \
something?</div><div><br></div><div>- Finally, in order to make MPI to work I had to \
install the mesos.interface with pip and manually copy the native directory from the \
python/dist-packages (native doesn't exist on the pip repo). And then I realized \
there is the <a href="http://mpiexec-mesos.in">mpiexec-mesos.in</a> file that it does \
all that - I can update the README to be a little more clear if you want - I am \
guessing someone else might also get confused with \
this.<br></div><div><br></div><div>thanks,</div><div>Stratos</div></div>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic