[prev in list] [next in list] [prev in thread] [next in thread]
List: beowulf
Subject: Re: [Beowulf] Questions about a large job
From: Bogdan Costescu <Bogdan.Costescu () iwr ! uni-heidelberg ! de>
Date: 2006-04-18 19:35:03
Message-ID: Pine.LNX.4.44.0604182123460.14562-100000 () kenzo ! iwr ! uni-heidelberg ! de
[Download RAW message or body]
On Tue, 18 Apr 2006, Leandro Tavares Carneiro wrote:
> The MPI used was LAM-MPI. I have run some tests with 10 nodes and it
> runs well. But, when I tried to run with 2296 CPUs, the job won't start.
Are you able to run a simple "hello world" test ? If not, you might be
hitting the per-process descriptor limit, as each process will try to
open a TCP connection to each other process - in this case you should
still be able to run a job on something like 500 nodes (=1000
processes, slightly less than the 1024 maximum descriptors per
process).
> Various errors happened, one for each try. The Torque version installed
> is 2.0.0p8 and is working fine with other largers jobs, with 1000 CPUs.
This just confirms my suspicion expressed above.
To change the limits on a Red Hat like system, add a line like:
* - nofile 4096
to /etc/security/limits.conf.
--
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit \
http://www.beowulf.org/mailman/listinfo/beowulf
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic