[prev in list] [next in list] [prev in thread] [next in thread]
List: npaci-rocks-discussion
Subject: Re: [Rocks-Discuss] SGE problem
From: Alex Hempy <Alex.Hempy () raytheon ! com>
Date: 2011-01-25 17:34:48
Message-ID: OFDBD745AA.16F41034-ON88257823.005C8D7A-88257823.006092B3 () mck ! us ! ray ! com
[Download RAW message or body]
Hi Greg,
Thank you for the reply. Here is the output of #qstat -f:
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@compute-0-0.local BIP 0/8/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
all.q@compute-0-1.local BIP 0/8/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
all.q@compute-0-2.local BIP 0/8/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
all.q@compute-0-3.local BIP 0/8/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
all.q@compute-0-4.local BIP 0/8/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
all.q@compute-0-5.local BIP 0/8/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
all.q@compute-0-6.local BIP 0/8/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
all.q@compute-0-7.local BIP 0/8/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
all.q@compute-0-8.local BIP 0/8/8 0.00 lx26-amd64
--------------------------------------------------------------------------------
all.q@compute-0-9.local BIP 0/8/8 0.00 lx26-amd64
Here is a little more background info: The cluster originally had a public
connection on eth1, and was assigned a FQDN. We eventually had to
disconnect it from the public network, so now it has no connection to
eth1. I'm not an expert in networking or how ethernet is configured within
rocks, but could this be why there is communication trouble with SGE and
the compute nodes? I can ssh into each node, and exports are working on
all of the nodes.
We are using iPython and HFSS on our cluster. We have successfully used
both before taking it off of the network, but now they are having trouble
- again, I'm not sure if taking it off the network caused the problem.
HFSS will begin on one compute node, and it will run correctly until it
breaks up the problem and tries to distribute it to other nodes.
Here is the relevant log information for both programs:
HFSS error:
[error] Project:Test1, Design:HFSSDesign1 (DrivenModal), Unable to locate
or start COM engine on 'compute-0-3.local' : no response from COM engine,
timed out
[error] Project:Test1, Design:HFSSDesign1 (DrivenModal), Distributed Solve
Error: Required task failed on machine 'compute-0-3.local'. Aborting
distributed solve
Python error:
startin...
compute-0-9.local
40
starting ipengines...
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
--------------------------------------------------------------------------
A daemon (pid 7774) died unexpectedly with status 1 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
compute-0-5.local - daemon did not report back when launched
compute-0-0.local - daemon did not report back when launched
compute-0-4.local - daemon did not report back when launched
compute-0-1.local - daemon did not report back when launched
compute-0-3.local - daemon did not report back when launched
compute-0-7.local - daemon did not report back when launched
compute-0-2.local - daemon did not report back when launched
compute-0-8.local - daemon did not report back when launched
compute-0-6.local - daemon did not report back when launched
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
Again, I greatly appreciate the help.
Thanks,
Alex
From:
Greg Bruno <greg.bruno@gmail.com>
To:
Discussion of Rocks Clusters <npaci-rocks-discussion@sdsc.edu>
Date:
01/24/2011 03:52 PM
Subject:
Re: [Rocks-Discuss] SGE problem
Sent by:
npaci-rocks-discussion-bounces@sdsc.edu
On Mon, Jan 24, 2011 at 7:42 AM, Alex Hempy <Alex.Hempy@raytheon.com>
wrote:
> Hi list,
>
> I am having trouble getting SGE (configured with $round_robin) to
> distribute a job across more than one compute node on my 5.3 Rocks
> cluster. SGE will only start the processing engines on one compute node,
> but it can't seem to talk to/utilize any other compute nodes. The
problem
> occurs with two different programs (both of which I have had working
> before on another 5.3 cluster) so I can't imagine it is the software I'm
> using. The error in the script log file reads "error reading job context
> from "qlogin_starter." I'm not sure where else to start looking for
> problems... any help or direction is much appreciated.
what is the output of:
# qstat -f
- gb
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic