'Re: ETCP Project & ha/hp overlap'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha
Subject:    Re: ETCP Project & ha/hp overlap
From:       David Brower <dbrower () us ! oracle ! com>
Date:       2001-03-01 23:10:23
[Download RAW message or body]

Alan Robertson wrote:

> > ETCP helps within migration clusters because by using it,
> > it becomes possible to have a network connection follow
> > a process, instead of tcp IO being relayed through the
> > original node.
> 
> The hard problem isn't at the transport layer, but at the application layer
> - in synchronizing application state.  For migration clusters, this is
> *comparatively* easy, but for failover clusters this is typically very hard.
> 
> I've added a link to this page on the High-Availability Linux web site:
> http://linux-ha.org/

I went and looked at etcp, and it is an example of the sort of
painful things that OS people need to do when apps don't have 
their own checkpoint restart story straight.  While it appears 
to be interesting, and may hold hope in the long term, I don't
think ETCP is interesting on the failover server side -- it is 
interesting for it's original domain (mobile clients) and for 
migration servers, which are the same thing in reverse.  In both those
cases, the application state remains constant on both sides
of the connection that got moved.  In failover-aware h/a,
that app state is usually lost, and this makes
the connection transparency moot for the most part.

This is a good example of the sort of thing I sent to some
people privately earlier today, below.

-dB

To: Lars Marowsky-Bree <lmb@suse.de>
CC: Chris Wright <chris@wirex.com>
Subject: Re: [riel@conectiva.com.br: [ANNOUNCE] linux-cluster list]

I think there are piles of overlap, but I also suspect there are
enough differences that there will be significantly different flavors.
AlanR pointed out some of the complexities of process migration,
as done in Mosix.  It seems to me that there are different levels
of checkpointing and restartability that will distinguish HP from
what I'll prefer to call "commercial" workload instead of HA.  Both
workloads need HP, and both heed HA, but the tradeoffs on
migratability differ greatly.   In the non-scientific space, it is
easy to imagine application platforms (eg: apache mods, database
engines, java/ejb environments) that manage application state in
a way that supports failover w/o particular OS support.  This won't
keep the OS guys from trying to checkpoint processes and migrate
them on failure, but it won't be as efficient.  An app can know that
the state needing recovery is only 48k of the 80M virtual space, and
that there are things around to recover the communication state.  The
OS won't know this, and will need to stash the full 80M, and figure
out how to recover all the tcp connections.  That is what makes
true transparency at the OS level hard.

The OS people are right, though, for naive applications that don't
want to be written to be effectively restartable.  It may be "only
be transparent to people who don't have a watch", in the words
of a colleague, if the OS has to snapshot whole processes frequently.

OTOH, if we ever find ourselves with far too much CPU and i/o
capacity, then OS checkpoint may be a good idea :)

cheers,
-dB

Lars Marowsky-Bree wrote:

> On 2001-02-28T16:45:13,
>    Chris Wright <chris@wirex.com> said:
>
> > Computational and high availability cluster's problem domains are not
> > 100% divergent.  Recall the roots of GFS for example.
>
> My personal prediction is that in the future, High Availability and High
> Performance clustering will merge completely, because anything else doesn't
> make sense at all.
>
> Sincerely,
>     Lars Marowsky-Brée <lmb@suse.de>

------------------------------------------------------------------------------
Linux HA Web Site:
  http://linux-ha.org/
Linux HA HOWTO:
  http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html
------------------------------------------------------------------------------

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic