[prev in list] [next in list] [prev in thread] [next in thread] 

List:       beowulf
Subject:    Re: [Beowulf] Questions about a large job
From:       Bogdan Costescu <Bogdan.Costescu () iwr ! uni-heidelberg ! de>
Date:       2006-04-18 19:35:03
Message-ID: Pine.LNX.4.44.0604182123460.14562-100000 () kenzo ! iwr ! uni-heidelberg ! de
[Download RAW message or body]

On Tue, 18 Apr 2006, Leandro Tavares Carneiro wrote:

> The MPI used was LAM-MPI. I have run some tests with 10 nodes and it
> runs well. But, when I tried to run with 2296 CPUs, the job won't start.

Are you able to run a simple "hello world" test ? If not, you might be
hitting the per-process descriptor limit, as each process will try to
open a TCP connection to each other process - in this case you should
still be able to run a job on something like 500 nodes (=1000
processes, slightly less than the 1024 maximum descriptors per
process).

> Various errors happened, one for each try. The Torque version installed
> is 2.0.0p8 and is working fine with other largers jobs, with 1000 CPUs.

This just confirms my suspicion expressed above.

To change the limits on a Red Hat like system, add a line like:

*	-	nofile	4096

to /etc/security/limits.conf.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit \
http://www.beowulf.org/mailman/listinfo/beowulf


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic