[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-kernel
Subject:    Re: NUMA & classzones (was Whining about 2.5)
From:       Rob Landley <landley () trommello ! org>
Date:       2001-10-04 23:55:25
[Download RAW message or body]

On Thursday 04 October 2001 19:39, Martin J. Bligh wrote:
> I'll preface this by saying I know a little about the IBM NUMA-Q
> (aka Sequent) hardware, but not very much about VM (or anyone
> else's NUMA hardware).

I saw the IBM guys in Austin give a talk on it last year, which A) had more 
handwaving that star wars episode zero, B) had FAR more info about politics 
in the AIX division than about NUMA, C) involved the main presenter letting 
us know he was leaving IBM at the end of the week...

Kind of like getting details about CORBA out of IBM.  And I worked there when 
i was trying to do that.  (I was once in charge of implementing corba 
compliance for a project, and all they could find me to define it at the time 
was a marketing brochure.  Sigh...)

> >> This gives obvious problems for NUMA, suppose you have 4
> >> nodes with zones 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C, 4A,
> >> 4B and 4C.
> >
> > Is there really a NUMA machine out there where you can DMA out of another
> > node's 16 bit ISA space?  So far the differences in the zones seem to be
>
> If I understand your question (and my hardware) correctly, then yes. I
> think we (IBM NUMA-Q) can DMA from anywhere to anywhere (using PCI cards,
> not ISA, but we could still use the ISA DMA zone).

Somebody made a NUMA machine with an ISA bus?  Wow.  That's peverse.  I'm 
impressed.  (It was more a "when do we care" question...)

Two points:

1) If you can DMA from anywhere to anywhere, it's one big zone, isn't it?  
Where does the NUMA come in?  (I guess it's more expensive to DMA between 
certain devices/memory pages?  Or are we talking sheer processor access 
latency here, nothing to do with devices at all...?)

2) A processor-centric view of memory zones is not the whole story.  Look at 
the zones we have now.  The difference between the ISA zone, the PCI zone, 
and high memory has nothing to do with the processor*.  It's a question of 
which devices (which bus/bridge really) can talk to which pages.  In current 
UP/SMP systems, the processor can talk to all of them pretty much equally.

* modulo the intel 36 bit extension stuff, which I must admit I haven't 
looked closely at.  Don't have the hardware.  Then again that's sort of the 
traditional numa problem of "some memory is a bit funky for the processor to 
access".  Obviously I'm not saying I/O is the ONLY potential difference 
between memory zones...

So we need zones defined relative not just to processors (or groups of 
processors that have identical access profiles), but also defined relative to 
i/o devices and busses.  Meaning zones may become a driver issue.

This gets us back to the concept of "nodes".  Groups of processors and 
devices that collectively have a similar view of the world, memory-wise.  Is 
this a view of the problem that current NUMA thinking is using, or not?

> But it probably doesn't make sense to define A,B, and C for each node. For
> a  start, we don't use ISA DMA (and probably no other NUMA box does
> either). If HIGHMEM is the stuff above 900Mb or so (and assuming for a
> moment that we have 1Gb per node), then we probably don't need
> NORMAL+HIGHMEM for each node either.
>
> 0-900Mb = NORMAL (1A)
> 900-1Gb = HIGHMEM_NODE1 (1B)
> 1G-2Gb = HIGHMEM_NODE2 (2)
> 2G-3Gb = HIGHMEM_NODE3 (3)
> 3Gb-4Gb = HIGHMEM_NODE4 (4)

By highmem you mean memory our I/O devices can't DMA out of?

Will all the I/O devices in the system share a single pool of buffer memory, 
or will devices be attached to nodes?

(My thinking still turns to making shared memory beowulf clusters act like 
one big system.  The hardware for that will continue to be cheap: rackmount a 
few tyan thunder dual athlon boards.  You can distribute drives for storage 
and swap space (even RAID them if you like), and who says such a cluster has 
to put all external access through a single node?

> If we have less than 1Gb per node, then one of the other nodes will have 2
> zones - whichever contains the transition point from NORMAL-> HIGHMEM.

So "normal" belongs to a specific node, so all devices basically belong to 
that node?

> Thus number of zones = number of nodes + 1.
> (to my mind, if we're frigging with the zone patterns for NUMA, getting rid
> of DMA zone probably isn't too hard).

You still have the problem of doing DMA.  Now this is a seperable problem 
boiling down to either allocation and locking of DMAable buffers the 
processor can directly access, or setting up bounce buffers when the actual 
I/O is kicked off.  (Or doing memory mapped I/O, or PIO.  But all that is 
still a bit like one big black box, I'd think.  And to do it right, you need 
to know which device you're doing I/O to, because I really wouldn't assume 
every I/O device on the system shares the same pool of DMAable memory.  Or 
that we haven't got stuff like graphics cards that have their own RAM we map 
into our address space.  Or, for that matter, that physical memory mapping in 
one node makes anything directly accessable from another node.)

> If I were allowed to define classzones as a per-processor concept (and I
> don't know enough about VM to know if that's possible), it would seem to
> fit nicely. Taking the map above, the classzones for a processor on node 3
> would be:
>
> {3} , {1A + 1B + 2+ 3 + 4}


Not just per processor.  Think about a rackmount shared memory beowulf 
system, page faulting through the network.  With quad-processor boards in 
each 1U, and BLAZINGLY FAST interconnects in the cluster.  Now teach that to 
act like NUMA.

Each 1U has four processors with identical performance (and probably one set 
of page tables if they share a northbridge).  Assembling NUMA systems out of 
closely interconnected SMP systems.

> If possible, I'd like to avoid making every single driver NUMA aware.

It may not be a driver issue.  It may be a bus issue.  If there are two PCI 
busses in the system, do they HAVE to share one set of physical memory 
mappings?  (NUMA sort of implies we have more than one northbridge.  Dontcha 
think we might have more than one southbridge, too?)

> Partly because I'm lazy, but also because I think it can be simpler than
> this. The mem subsystem should just be able to allocate something that's as
> good as possible for that card, without the driver worrying explicitly
> about zones (though it may have to specify if it can do 32/64 bit DMA).

It's not just 32/64 bit DMA.  You're assuming every I/O device in the system 
is talking to exactly the same pool of memory.  The core assumption of NUMA 
is that the processors aren't doing that, so I don't know why the I/O devices 
necessarily should.  (Maybe they do, what do I know.  It would be nice to 
hear from somebody with actual information...)

And if they ARE all talking to one pool of memory, than the whole NUMA 
question becomes a bit easier, actually...  The flood of zones we were so 
worried about (Node 18's processor sending packets through a network card 
living on node 13) can't really happen, can it?

> see http://lse.sourceforge.net/numa - there should be some NUMA API
> proposals there for explicit stuff.

Thanks for the link. :)

> > I always think of numa as the middle of a continuum.  Zillion-way SMP
> > with enormous L1 caches on each processor starts acting a bit like NUMA
> > (you don't wanna go out of cache and fight the big evil memory bus if you
> > can at all avoid it, and we're already worrying about process locality
> > (processor affinity) to preserve cache state...).
>
> Kind of, except you can explicitly specify which bits of memory you want to
> use, rather than the hardware working it out for you.

Ummm...

Is the memory bus somehow physically reconfiguring itself to make some chunk 
of memory lower or higher latency when talking to a given processor?  I'm 
confused...

> > Shared memory beowulf clusters that
> > page fault through the network with a relatively low-latency interconnect
> > like myrinet would act a bit like NUMA too.
>
> Yes.

But that's the bit that CLEARLY works in terms of nodes, and also which has 
devices attached to different nodes, requiring things like remote tasklets to 
access remote devices, and page migration between nodes to do repeated access 
on remote pages.  (Not that this is much different than sending a page back 
and forth between processor caches in SMP.  Hence the continuum I was talking 
about...)

The multiplicative complexity I've heard fears about on this list seems to 
stem from an interaction between "I/O zones" and "processor access zones" 
creating an exponential number of gradations when the two qualities apply to 
the same page.  But in a node setup, you don't have to worry about it.  A 
node has its local memory, and it's local I/O, and it inflicts work on remote 
zones when it needs to deal with their resources.  There may be one big 
shared pool of I/O memory or some such (IBM's NUMA-Q), but in that case it's 
the same for all processors.  Each node has one local pool, one remote pool, 
and can just talk to a remote node when it needs to (about like SMP).

I THOUGHT numa had a gradient, of "local, not as local, not very local at 
all, darn expensive" pages that differed from node to node, which would be a 
major pain to optimize for yes.  (I was thinking motherboard trace length and 
relaying stuff several hops down a bus...)  But I haven't seen it yet.  And 
even so, "not local=remote" seems to cover the majority of the cases without 
exponential complexity...

I am still highly confused.

> > (Obviously, I haven't played
> > with the monster SGI hardware or the high-end stuff IBM's so proud of.)
>
> There's a 16-way NUMA (4x4) at OSDL (www.osdlab.org) that's running
> linux and available for anyone to play with, if you're so inclined. It
> doesn't understand very much of it's NUMA-ness, but it works. This is the
> IBM NUMA-Q hardware ... I presume that's what you're referring to).

That's what I've heard the most about.  I'm also under the impression that 
SGI was working on NUMA stuff up around the origin line, and that sun had 
some monsters in the works as well...

It still seems to me that either clustering or zillion-way SMP is the most 
interesting area of future supercomputing, though.  Sheer price to 
performance.  For stuff that's not very easily seperable into chunks, they've 
got 64 way SMP working in the lab.  For stuff that IS chunkable, thousand box 
clusters are getting common.  If the interconnects between boxes are a 
bottleneck, 10gigE is supposed to be out in late 2003, last I heard, meaning 
gigE will get cheap...  And for just about everything else, there's Moore's 
Law...

Think about big fast-interconnect shared memory clusters.  Resources are 
either local or remote through the network, you don't care too much about 
gradients.  So the "symmetrical" part of SMP applies to decisions between 
nodes.  There's another layer of decisions in that a node may be an SMP box 
in and of itself (probably will), but there's only really two layers to worry 
about, not an exponential amount of complexity where each node has a 
potentially unique relationship with every other node...

People wanting to run straightforward multithreaded programs using shared 
memory and semaphores on big clusters strikes me as an understandable goal, 
and the drive for fast (low latency) interconnects to make that feasible is 
something I can see a good bang for the buck coming out of.  Here's the 
hardware that's widely/easily/cheaply available, here's what programmers want 
to do with it.  I can see that.

The drive to support monster mainframes which are not only 1% of the market 
but which get totally redesigned every three or four years to stay ahead of 
moore's law...  I'm not quite sure what's up there.  How much of the market 
can throw that kind of money to constantly offset massive depreciation?

Is the commodity hardware world going to inherit NUMA (via department level 
shared memory beowulf clusters, or just plain the hardware to do it getting 
cheap enough), or will it remain a niche application?

As I said: master of stupid questions.  The answers are taking a bit more 
time...

> > In a way, swap space on the drives could be considered a
> > performance-delimited physical memory zone.  One the processor can't
> > access directly, which involves the allocation of DRAM bounce buffers. 
> > Between that and actual bounce buffers we ALREADY handle problems a lot
> > like page migration between zones (albeit not in a generic, unified
> > way)...
>
> I don't think it's quite that simple. For swap, you always want to page
> stuff back in before using it. For NUMA memory on remote nodes, it may or
> may not be worth migrating the page.

Bounce buffers.  This is new?  Seems like the same locking issues, even...

> If we chose to migrate a process
> between nodes, we could indeed set up a system where we'd page fault pages
> in from the remote node as we used them, or we could just migrate the
> working set with the process.

Yup.  This is a problem I've heard discussed a lot: deciding when to migrate 
resources.  (Pages, processes, etc.)  It also seems to be a seperate layer of 
the problem, one that isn't too closely tied to the initial allocation 
strategy (although it may feed back into it, but really that seems to be just 
free/alloc and maybe adjusting weighting/ageing whatever.  Am I wrong?)

I.E. migration strategy and allocation strategy aren't necessarily the same 
thing...

> Incidentally, swapping on NUMA will need per-zone swapping even more,
> so I don't see how we could do anything sensible for this without a
> physical to virtual mem map. But maybe someone knows how.

There you got me.  I DO know that you can have multiple virtual mappings for 
each physical page, so it's not as easy as the other way around, but this 
could be why the linked list was invented...

(I believe Rik is working on patches that cover this bit.  Haven't looked at 
them yet.)

> > So I thought the cheap and easy way out is to have each node know what
> > resources it considers "local", what resources are a pain to access
> > (possibly involving a tasklet on annother node), and a way to determine
> > when tasks requiring a lot of access to them might better to be migrated
> > directly to a node where they're significantly cheaper to the point where
> > the cost of migration gets paid back.  This struck me as the 90% "duct
> > tape" solution to NUMA.
>
> Pretty much. I don't know of any situation when we need a tasklet on
> another node - that's a pretty horrible thing to have to do.

Think shared memory beowulf.

My node has a hard drive.  Some other node wants to read and write to my hard 
drive, because it's part of a larger global file system or storage area 
network or some such.

My node has a network card.  There are three different connections to the 
internet, and they're on seperate nodes to avoid single point of failure 
syndrome.

My node has a video capture card.  The cluster as a whole is doing realtime 
video acquisition and streaming for a cable company that saw the light and 
switched over to MP4 with a big storage cluster.  Incoming signals from cable 
(or movies fed into the system for pay per view) get converted to mp4 
(processor intensive, cluster needed to keep up with HDTV, especially 
multiple channels) and saved in the storage area network part, and subscriber 
channels get fetched and fed back out.  (Probably not as video, probably as a 
TCP/IP stream to a set top box.  The REAL beauty of digital video isn't 
trying to do "moves on demand", it's having a cluster stuffed with old 
episodes of Mash, ER, The West Wing, Star Trek, The Incredible Hulk, Dark 
Shadows, and Dr. Who which you can call up and play at will.  Syndicated 
content on demand.  EASY task for a cluster to do.  Doesn't NEED to think 
NUMA, that could be programmed as beowulf.  But we could also be using the 
Mach microkernel on SMP boxes, it makes about as much sense.  Beowulf is 
message passing, microkernels are message passing, CORBA is message 
passing...  Get fast interconnects, message passing becomes less and less of 
a good idea...)

> > So what hardware inherently requires a multi-tier NUMA approach beyond
> > "local stuff" and "everything else"?  (I suppose there's bound to be some
> > linearlly arranged system with a long gradual increase in memory access
> > latency as you go down the row, and of course a node in the middle which
> > has a unique resource everybody's fighting for.  Is this a common setup
> > in NUMA systems?)
>
> The next generation of hardware/chips will have more heirarchical stuff.
> The shorter / smaller a bus is, the faster it can go, so we can tightly
> couple small sets faster than big sets.

Sure.  This is electronics 101, the speed of light is not your friend.  
(Intel fought and lost this battle with the pentium 4's pipeline, more haste 
less speed...)

But the question of how much of a gradient we care about remains.  It's 
either local, or it's not local.

The question is latency, not throughput.  (Rambus did this too, more 
throughput less latency...)  Lots of things use loops in an attempt to get 
fixed latency: stuff wanders by at known intervals so it's easy to fill up 
slots on the bus because you know when your slot will be coming by...

NUMA is also a question of latency.  Gimme high end fiber stuff and I could 
have a multi-gigabit pipe between two machines in different buildings.  
Latency will still make it a less fun to try to page access DRAM through than 
your local memory bus, regardless of relative throughput.

> > And then, of course, there's the whole question of 3D accelerated video
> > card texture memory, and trying to stick THAT into a zone. :)  (Eew! 
> > Eew!  Eew!) Yeah, it IS a can of worms, isn't it?
>
> Your big powerful NUMA server is going to be used to play Quake on? ;-)
> Same issue for net cards, etc though I guess.

Not quake, video capture and streaming.  Big market there, which beowulf 
clusters can address today, but in a fairly clumsy way.  (The sane way to 
program that is to have one node dispatching/accepting frames to other nodes, 
so beowulf isn't so bad.  But message passing is not a way to control 
latency, and latency is your real problem when you want to avoid droppping 
frames.  Buffering helps this, though.  Five seconds of buffer space covers a 
multitude of sins...)

> > But class/zone lists still seem fine for processors.  It's just a
> > question of doing the detective work for memory allocation up front, as
> > it were.  If you can't figure it out up front, how the heck are you
> > supposed to do it efficiently at allocation time?
>
> If I understand what you mean correctly, we should be able to lay out
> the topology at boot time, and work out which phys mem locations will
> be faster / slower from any given resource (proc, PCI, etc).

Ask andrea.  I THINK so, but I'm not the expert.  (And Linus seems to 
disagree, and he tends to have good reasons. :)

> > This
> > chunk of physical memory can be used as DMA buffers for this PCI bridge,
> > which can only be addressed directly by this group of processors anyway
> > because they share the IO-APIC it's wired to...
>
> Hmmm ... at least in the hardware I'm familiar with, we can access any PCI
> bridge or any IO-APIC from any processor. Slower, but functional.

Is the speed difference along a noticeably long gradient, or more "this group 
is fast, the rest is not so fast"?

And do the bridges and IO-APICS cluster with processors into something that 
looks like nodes, or do they overlap in a less well defined way?

> > Um, can bounce buffers permanent page migration to another zone?  (Since
> > we have to allocate the page ANYWAY, might as well leave it there till
> > it's evicted, unless of course we're very likely to evict it again pronto
> > in which case we want to avoid bouncing it back...
>
> As I understand zones, they're physical, therefore pages don't migrate
> between them.

And processors are physical, so tasks don't migrate between them?

> The data might be copied from the bounce buffer to a
> page in another zone, but  ...

Virtual page, physical page...

> Not sure if we're using quite the same terminology. Feel free to correct
> me.

I'm more likely to receive correction.  I'm trying to learn and understand 
the problem...

> > Hmmm...  Then under NUMA there
> > would be the "processor X can't access page in new location easily to
> > fill it with new data to DMA out..."  Fun fun fun...)
>
> On the machines I'm used to, there's no problem with "can't access", just
> slower or faster.

Well, with shared memory beowulf clusters you could have a tasklet on the 
other machine lock the page and spit you a copy of the data, so "can't" 
doesn't work there either.  That's where the word "easily" came in...

But an attempt to DMA into or out of that page from another node would 
involve bounce buffers on the other node...

> > Since 2.4 isn't supposed to handle NUMA anyway, I don't see what
> > difference it makes.  Just use ANYTHING that stops the swap storms,
> > lockups, zone starvation, zero order allocation failures, bounce buffer
> > shortages, and other such fun we were having a few versions back.  (Once
> > again, this part now seems to be in the "it works for me"(tm) stage.)
> >
> > Then rip it out and start over in 2.5 if there's stuff it can't do.
>
> I'm not convinced that changing directions all the time is the most
> efficient way to operate

No comment on the 2.4.0-2.4.10 VM development process will be made by me at 
this time.

> - it would be nice to keep building on work
> already done in 2.4 (on whatever subsystem that is) rather than rework
> it all, but maybe that'll happen anyway, so ....

At one point I thought the purpose of a stable series was to stabilize, 
debug, and tweak what you'd already done, and architectural changes went in 
development series.  (Except for the occasional new driver.)  As I said, I 
tend to be wrong about stuff...

> Martin.

Rob
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic