'Re: [Rocks-Discuss] SGE problem'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       npaci-rocks-discussion
Subject:    Re: [Rocks-Discuss] SGE problem
From:       Alex Hempy <Alex.Hempy () raytheon ! com>
Date:       2011-01-25 17:34:48
Message-ID: OFDBD745AA.16F41034-ON88257823.005C8D7A-88257823.006092B3 () mck ! us ! ray ! com
[Download RAW message or body]

Hi Greg,

Thank you for the reply. Here is the output of #qstat -f:

queuename                      qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@compute-0-0.local        BIP   0/8/8          0.00     lx26-amd64    

---------------------------------------------------------------------------------
all.q@compute-0-1.local        BIP   0/8/8          0.00     lx26-amd64    

---------------------------------------------------------------------------------
all.q@compute-0-2.local        BIP   0/8/8          0.00     lx26-amd64    

---------------------------------------------------------------------------------
all.q@compute-0-3.local        BIP   0/8/8          0.00     lx26-amd64    

---------------------------------------------------------------------------------
all.q@compute-0-4.local        BIP   0/8/8          0.00     lx26-amd64    

---------------------------------------------------------------------------------
all.q@compute-0-5.local        BIP   0/8/8          0.00     lx26-amd64    

---------------------------------------------------------------------------------
all.q@compute-0-6.local        BIP   0/8/8          0.00     lx26-amd64    

---------------------------------------------------------------------------------
all.q@compute-0-7.local        BIP   0/8/8          0.00     lx26-amd64    

---------------------------------------------------------------------------------
all.q@compute-0-8.local        BIP   0/8/8          0.00     lx26-amd64    

--------------------------------------------------------------------------------
all.q@compute-0-9.local        BIP   0/8/8          0.00     lx26-amd64    


Here is a little more background info: The cluster originally had a public 
connection on eth1, and was assigned a FQDN. We eventually had to 
disconnect it from the public network, so now it has no connection to 
eth1. I'm not an expert in networking or how ethernet is configured within 
rocks, but could this be why there is communication trouble with SGE and 
the compute nodes? I can ssh into each node, and exports are working on 
all of the nodes.

We are using iPython and HFSS on our cluster. We have successfully used 
both before taking it off of the network, but now they are having trouble 
- again, I'm not sure if taking it off the network caused the problem. 
HFSS will begin on one compute node, and it will run correctly until it 
breaks up the problem and tries to distribute it to other nodes.

Here is the relevant log information for both programs:

HFSS error:
[error] Project:Test1, Design:HFSSDesign1 (DrivenModal), Unable to locate 
or start COM engine on 'compute-0-3.local' : no response from COM engine, 
timed out
[error] Project:Test1, Design:HFSSDesign1 (DrivenModal), Distributed Solve 
Error: Required task failed on machine 'compute-0-3.local'. Aborting 
distributed solve

Python error:
startin...
compute-0-9.local
40
starting ipengines...
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
--------------------------------------------------------------------------
A daemon (pid 7774) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
        compute-0-5.local - daemon did not report back when launched
        compute-0-0.local - daemon did not report back when launched
        compute-0-4.local - daemon did not report back when launched
        compute-0-1.local - daemon did not report back when launched
        compute-0-3.local - daemon did not report back when launched
        compute-0-7.local - daemon did not report back when launched
        compute-0-2.local - daemon did not report back when launched
        compute-0-8.local - daemon did not report back when launched
        compute-0-6.local - daemon did not report back when launched
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"


Again, I greatly appreciate the help. 

Thanks,
Alex




From:
Greg Bruno <greg.bruno@gmail.com>
To:
Discussion of Rocks Clusters <npaci-rocks-discussion@sdsc.edu>
Date:
01/24/2011 03:52 PM
Subject:
Re: [Rocks-Discuss] SGE problem
Sent by:
npaci-rocks-discussion-bounces@sdsc.edu



On Mon, Jan 24, 2011 at 7:42 AM, Alex Hempy <Alex.Hempy@raytheon.com> 
wrote:
> Hi list,
>
> I am having trouble getting SGE (configured with $round_robin) to
> distribute a job across more than one compute node on my 5.3 Rocks
> cluster. SGE will only start the processing engines on one compute node,
> but it can't seem to talk to/utilize any other compute nodes.  The 
problem
> occurs with two different programs (both of which I have had working
> before on another 5.3 cluster) so I can't imagine it is the software I'm
> using. The error in the script log file reads "error reading job context
> from "qlogin_starter." I'm not sure where else to start looking for
> problems... any help or direction is much appreciated.

what is the output of:

    # qstat -f

 - gb


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic