[prev in list] [next in list] [prev in thread] [next in thread] 

List:       beowulf
Subject:    Re: [Beowulf] OOM errors when running HPL
From:       Prentice Bisbal <prentice () ias ! edu>
Date:       2008-12-22 22:12:39
Message-ID: 495010D7.4090802 () ias ! edu
[Download RAW message or body]

Skylar Thompson wrote:
> Prentice Bisbal wrote:
> > I've got a new problem with my cluster. Some of this problem may be with
> > my queuing system (SGE), but I figured I'd post here first.
> > 
> > I've been using hpl to test my new cluster. I generally run a small
> > problem size (Ns=60000)so the job only runs 15-20 minutes. Last night, I
> > upped the problem size by a factor of 10 to Ns=600000). Shortly after
> > submitting the job, have the nodes were shown as down in Ganglia.
> > 
> > I killed the job with qdel, and the majority of the nodes came back, but
> > about 1/3 did not. When I came in this morning, there were kernel
> > panic/OOM type messages on the consoles of the systems that never came
> > back.
> > 
> > I used to run hpl jobs much bigger than this on my cluster w/o a
> > problem. There's nothing I actively changes, but there might have been
> > some updates to the OS (kernel, libs, etc) since the last time I ran a
> > job this big. Any ideas where I should begin looking?
> 
> I've run into similar problems, and traced it to the way Linux
> overcommits RAM. What are your vm.overcommit_memory and
> vm.overcommit_ratio sysctls set to, and how much swap and RAM do the
> nodes have?
> 

I found the problem - it was me. I never ran HPL problems with Ns=600k.
The largest job I ran was ~320k. I figured this out after checking my
notes. Sorry for the trouble.

However, I did want to configure my systems so that they handle requests
for more memory more gracefully, so I added this to my sysctl.conf file
(Thanks for the reminder, Skylar!)

vm.overcommit_memory=2
vm.overcommit_ratio=100

I am actually using this on many of my other computational servers to
prevent OOM crashes, but forgot to add this to my cluster nodes.

Thanks to everyone for the replies.

-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit \
http://www.beowulf.org/mailman/listinfo/beowulf


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic