'Re: SSI-NUMA-MP; cluster or not?; interprocess communication'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-cluster
Subject:    Re: SSI-NUMA-MP; cluster or not?; interprocess communication
From:       "Bill Todd" <billtodd () foo ! mv ! com>
Date:       2001-07-19 5:54:10
[Download RAW message or body]


----- Original Message -----
From: "Stan Hoeppner" <Stan@hardwarefreak.devastation.cc>
To: <linux-cluster@nl.linux.org>
Sent: Wednesday, July 18, 2001 9:53 PM
Subject: RE: SSI-NUMA-MP; cluster or not?; interprocess communication

I was going to respond to JA's message, but Greg beat me to it and covered
some of what I would have said.  And responding here may actually focus
better on some points.


...

> For the purposes of this list, I believe the definition of "cluster", or
> "machine cluster" is more analagous to creating a single large machine of
a
> number of smaller ones, and getting as close as possible to the single
large
> machine paradigm.

That seems to be JA's stance as well.  I'd offer a pretty much opposite
suggestion:

There are already good approaches to the single large machine paradigm, and
Moore's law works fast enough that I question just how general a need there
is to create yet another one that will almost by definition not be as good
(at acting like a single large machine) as SMP and NUMA (at least not-too-NU
NUMA) already are.

To look at it another way, if you want to act like a single large machine
(for the things that a single large machine is best at - single large,
relatively unpartitionable applications), the first thing your cluster
really ought to support is spreading the individual threads of a single
process (as well as the process's virtual memory) across multiple cluster
nodes - since that's how a process designed to run optimally on a single
large machine will be structured.

If you can't do that, you haven't really provided an environment that's a
good replacement for the single large machine - and doing it at all
transparently requires not only *very* low-latency inter-node communication
(and in this context one should remember that, while Infiniband will
certainly lower communication latency, by the time it's available -
especially at low cost - processor, cache, and memory latencies will have
decreased a lot more as well) but what I suspect might be viewed as a
completely unacceptable level of kernel intrusion (let alone rather a lot of
just plain design/implementation effort you may not have been planning on).

Once you back away (if I'm not assuming too much in this regard) from
supporting distributed processes at the thread level, you enter the realm
where the large applications you wish to run are already partitioned into
separate processes that likely pay at least some heed to minimizing the
amount of IPC they require (if only because intra-process
communication/synchronization usually has noticeably lower overheads then
IPC even on a single node, unless non-message-passing IPC such as local
shared memory is used).  The fact that the application is already aware of
this partitioning at least potentially makes it moderately easy to adapt
such an application to almost *any* cluster approach that provides
reasonably (not even very) low-latency inter-node communication, including
existing MPI-style (if I've got that right, not being familiar with them)
approaches.

So instead of focusing on a cluster as a lower-cost substitute for a large
SMP/NUMA system, consider thinking about clustering as *orthogonal* to the
dimension that SMP/NUMA scales over - though without compromising a
cluster's ability to act as a large-system alternative for
reasonably-partitionable applications (since such partitioning isn't
transparent anyway).

Viewed that way, clustering offers increased availability (via mechanisms
ranging from simple fail-over through concurrently-accessed-shared-storage
N+1 redundancy configurations to geographically-separated replica sites) and
increased scalability (in that each cluster node can potentially be the
largest single SMP/NUMA system you can buy):  there's still some functional
overlap with SMP/NUMA (the 'scale up and/or scale out' approaches to
expansion), but this kind of clustering seems to add a good deal more unique
value to one's configuration options than a form that simply attempts to act
as a low-cost substitute for SMP/NUMA.

Note that when the cluster is used to provide a service you still want the
features that allow the cluster to be viewed *from the outside* as a single
machine (even if some of this support is built into clients, which can make
life quite a bit easier in many cases):  it's *internally* that one doesn't
need (and in some cases may actually not want) that level of transparency.

I think Oracle's experiences provide at least some guidance in this area.
They started out being limited by the size of SMP machines, then created OPS
to take advantage of VMS (and more recently other) clusters for more
head-room, and recently have (IIRC) been merging partitioning approaches
with OPS-style configurations to increase their head-room even more - such
that they can run the areas of a database that are too intertwined to
partition well on sub-clusters composed of SMP/NUMA systems (i.e.,
sub-clusters which themselves scale to pretty respectable levels) and and
aggregate these relatively-independent sub-clusters to handle the total data
set.  This also sounds as if it might be at least a bit like the
'hierarchical clustering' I've heard mentioned a couple of times here (as an
aside, it would surprise me if GFS - or any other extant file system code
out there - supported something like that in a really scalable manner).

At least that's the kind of clustering I'm familiar with.  But I acknowledge
that I'm not at all familiar with HP/scientific clustering, so there may be
aspects of it that this kind of clustering would need to be extended to
cover well.

...

> What are everyone's ideas on this?

Well, you did ask...

- bill

>
> Stan Hoeppner
> TheHardwareFreak
> www.hardwarefreak.devastation.cc
> stan@hardwarefreak.devastation.cc
>
> Linux-cluster: generic cluster infrastructure for Linux
> Archive:       http://mail.nl.linux.org/linux-cluster/
>


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic