[prev in list] [next in list] [prev in thread] [next in thread] 

List:       beowulf
Subject:    transparent process migration
From:       Greg Lindahl lindahl () cs ! virginia ! edu
Date:       1998-04-16 12:12:00
[Download RAW message or body]

> 	o  Maddog suggested we create utilities that would allow one to
> 	   force task migration from one machine, so that one could upgrade
> 	   hardware w/o taking down the whole cluster
> 	o  If a node fails, reinitiate it's chunks of the computation on
> 	   other workstations.  Perhaps we can do fault tolerance after
> 	   all. :)  of course this does incur a little more overhead, but
> 	   that's to be expected.

Legion has a pretty open architecture for doing these sorts of things.
Processes need to be able to save their state to disk. One good, clean
way of doing that is programs which can explicitly do that; most
hydrocodes can save their state. Another way to do that is a process
snapshot like the SGI cpr thing or what Condor does. But it's better
to allow either or other mechanisms as well; some processes have much
smaller actual state than their entire process image, or a process
image might run into trouble with files or other weird stuff.

If a node fails, you may have to roll back the entire computation, or
you may be able to just re-do parts of it. Some Legion objects are
stateless, and they can just be recomputed. A typical MPI application
marches in lockstep, so you have to roll everyone back, or you can
use message logging to start one process and have it "catch up" with
the rest of the computation.

Once your processes can save their state, you can migrate them to
different machines.

This stuff has certainly been researched a lot; producing usable,
general tools is harder.

-- greg

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic