'Re: Taroon + numa question ?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       taroon-list
Subject:    Re: Taroon + numa question ?
From:       "Paul Krizak" <paul.krizak () amd ! com>
Date:       2006-11-28 17:41:27
Message-ID: 456C74C7.2040401 () amd ! com
[Download RAW message or body]

Unfortunately RHEL3's NUMA implementation is extraordinarly broken.

We've been working with them for many months now trying to fix it, and 
have finally given up and are waiting for RHEL5 to implement a correct 
NUMA architecture for our CPUs.

As far as we can tell, here's what happens for various OS/NUMA combinations:

RHEL3, U<7 + NUMA:
* Processes that grow larger than one "NUMA node", i.e. the memory 
attached to one node, will dip into swap instead of using memory from 
other nodes.
* The kernel has no clue where the memory is mapped in relation to the 
CPU cores, and so makes dumb decisions on where to put processes.  Using 
cpuset helps, but is impractical for a batch compute node.

RHEL3, U>=7 + NUMA:
* Processes that grow larger than one "NUMA node" will use memory from 
other nodes, but at a SEVERE performance penalty (though not as great as 
using swap)
* The kernel still has no clue where the memory is mapped to CPU cores.

RHEL4, U>=2 + NUMA:
* Memory allocation works fine.  No significant performance penalty as a 
process grows beyond one node.
* The kernel is still clueless about where to put processes.

Given the poor luck we've had with NUMA, our compute nodes run with the 
following configuration options, and perform (on average) better than 
with NUMA enabled:

* ACPI SRAT table disabled
* Node Interleaving enabled
* Bank Interleaving enabled

"numa=off" on kernel command line

Various hardware and software vendors balk at the disabling of NUMA, but 
our internal benchmarks don't lie -- NUMA is way slower (in general) in 
RHEL3 and RHEL4 than non-NUMA in RHEL3 and RHEL4 on the Opteron platform.

Paul Krizak                         5900 E. Ben White Blvd. MS 625
Advanced Micro Devices              Austin, TX  78741
Linux/Unix Systems Engineering      Phone: (512) 602-8775
Silicon Design Division             Cell:  (512) 791-0686

Stanley, Jon wrote:
>> Hmm, somehow I think NUMA doesn't really come in to play with a single 
>> socket whatevercore cpu's.
>>
> 
> This is correct - a single physical socket is one NUMA node.  However,
> in the event of a process needing 6GB of RAM, it will get the memory
> from another node prior to going to swap for it.  There is a performance
> penalty with inter-node memory allocation, but it's not really something
> that can be avoided, I don't think.
> 
> Again, I could be totally wrong.
> 
> --
> Taroon-list mailing list
> Taroon-list@redhat.com
> https://www.redhat.com/mailman/listinfo/taroon-list
> 
> 

--
Taroon-list mailing list
Taroon-list@redhat.com
https://www.redhat.com/mailman/listinfo/taroon-list
[prev in list] [next in list] [prev in thread] [next in thread]