[prev in list] [next in list] [prev in thread] [next in thread] 

List:       intermezzo-devel
Subject:    InterMezzo recovery
From:       "Peter J. Braam" <braam () cs ! cmu ! edu>
Date:       2000-01-17 4:33:54
[Download RAW message or body]


Hi Susan,

Here is the document I promised.

- Peter -




InterMezzo File System Recovery

Peter J. Braam
Jan 15, 2000


Overview of the problems
------------------------

InterMezzo is a replicating file system that manages:

1. a cache of files, ultimately to be kept in sync with the server
copy of these files.  The cache is an Ext2/3 file system.

2. a log of modifications to an InterMezzo cache; we refer to this as
the ML.

3. a file system database, aka the FSDB, maintaining extra information
for objects in the cache.


Modifications to the cache can be made by users of the InterMezzo file
system, in which case their updates cause corresponding records to be
generated in the modification log.   Modifications can also be made by
the cache manager in case it receives updates from the file server.
In this case, updates will be made to the FSDB reflecting the new
objects or versions that were introduced in the cache.

The cache has internal consistency, but this is the file system
consistency of an Ext2/3 file system.  Similarly the records in the
FSDB may have a compound structure which would require atomic updates
to remain internally consistent.

The recovery problem for InterMezzo can two forms.  One form will
exploit the journaling infrastructure of Ext3 and the task is to
leverage Ext3 recovery to also keep the ML and FSDB in sync, in
conjunction with the consistency of the cache.

The second form is to extend e2fsck to address the consistency of the
ML and FSDB. The second form was discussed in "The InterMezzo File
System", [Braam, Callahan, Schwan; O'Reilly Perl Conference, 1999],
see: http://www.inter-mezzo.org/docs.   It is very complicated and we
will not discuss it here.


Journal recovery
----------------

Ext3 journaling introduces a transaction like property for metadata
updates of Ext2 file systems.  The transactions are not totally "ACID"
(atomic, consistent, isolated and durable), they are merely "ACI".
This avoids synchronous disk writes, but preserves a consistent image
of the file system, but sacrifices in the arena that a bounded amount
of dat

Stephen Tweedie's journal implementation for Ext3 has the desirable
property that other users of the journaling system can join the Ext3
transactions.   Since InterMezzo wraps around Ext3 file system
operations it is an ideal candidate to exploit this property.  The
following can be achieved:

0. An InterMezzo VFS method is called.  InterMezzo starts a journal
transaction, and perhaps makes updates to the FSDB or Journal.

1. The corresponding Ext3 method is invoked.  Ext3 now joins this
transaction, and when Ext3 commits the transaction is not yet
commited, but handed back to InterMezzo for further transactional
updates.

2. InterMezzo commits "ACI" the changes it and Ext3 made.

It is now clear that if Ext3 is used in the mode that both data and
metadata are handled transactionally, one merely needs to implement
the FSDB and ML as Ext3 files to gain the ACI properties for updates.

There is still one issue further than this.  When a server or client
recovers after a crash it may find that it lost some entries it made
in the FSDB or ML.  Replicators of the affected volumes would find
upon contacting or being contacted by the recovered system that the
replicator state was off by a few records.

The server should enquire among all replicators who has seen the
latest ML records find out if any replicator still has the records
missing on the recovered system.  Likely the server itself should
become the owner of such records and then reintegrate these to the
recovered client.

It is therefore desirable that systems retain a few already handled ML
records to enhance the chances that records lost might be restored
automatically.  The time during which records settle is a known
constant S and the system could exploit knowledge of that constant to
retain those records spanning a certain window.

If replicator state cannot be recovered automatically, a conflict
should be declared.  The conflict can only affect files whose ML or
FSDB records were lost and since the buffers on Linux systems will be
flushed in X seconds,
what files have modification or change times later than X seconds
before the crash.  These appears to lead to a rather bounded set of
candidates needed for conflict resolution.


File Data Recovery
------------------

For efficiency reasons, Ext3 does not normally journal file data, but
merely metadata.  This means that file data is subject to
unpredictable behaviour.

The roles of servers and clients switch rapidly in InterMezzo and we
will refer to the client as the system receiving a reintegration
request; the system sending the reintegration request will be called
the server.  (This system may be the InterMezzo client or server.)

For example, a client may crash shortly after having fetched a new
version of a file.  If there were no replicators one would have to
accept that the data on the disk was the best possible copy, but in
the presence of replicators it is possible that the server has a more
recent version than the recovered client.

This example is covers only half of the problem: the client may also
itself have modified the data or added new data just before a crash.

Ext3 has the property that if file data is overwritten, the metadata
will be written to disk before data updates are written to disk.  This
implies that if files are merely overwritten, the versions in the
metadata cannot be trusted to reflect the currency of the file.

For new data, the converse holds: here the file data is written before
the updates are made to the metadata.  This is the way in which Ext2/3
avoids that due to crashes, people can see random data that previously
belonged to other files in their files. [this ignores the issue that
SCSI disk drives may do write reordering; when they do our arguments
here are not valid.].

So the write ordering introduced by Ext3 is:

1. new data
2. new metadata
3. overwritten data

This means that for newly created objects, if the file metadata is up
to date, the data that was newly written is up to date too. Hence
newly created data either generated locally or sent over from a
replicato

If a file was partly overwritten by as a client modification, the
metadata will have been updated before that happened and a version
comparison with reveal that the file is newer than copies on
replicators.  One has the option of declaring this data to be the
authoritative copy of the data, which is what Ext2/3 does on a single
system, but one may also revert to the server copy.  The files at risk
of such discrepancies, are those that were modified in the period
during which the buffer cache may not have settled on disk. Any file
newer than X before crash time must be regarded as possibly among
those.

These files may suffer from two problems:

1. there is no "STORE" record in the ML, but the client version is
newer than the server version

2. the client version is equal to the server version, but the data on
the client is not that of the server.

An automatic conflict resolution here is possible, but risky since
again, the single system answer of accepting the client data as the
reference copy may not be acceptable.

Nevertheless, recovery is now reduced to a rather elementary scan of
new files and comparing their versions and content with those on the
server.

RPC failures
------------

In addition to data recovery, RPC's may have failed and this can have
side effects in conjunction with recovery.  For example, a
reintegration from server to client may have send a store record,
which was executed on the client, but the reply never reached the
server.  In that case, the server will send that record again, but the
client must take care not to blindly redo the operation, since it
might have overwritten the data.  The conflict detection through
versions will detect these problems.

The fsync/flock/lock system calls and recovery
----------------------------------

For InterMezzo to be really useful it should support the fsync method
well.

A typical use is that sendmail writes mail into an incoming spool
directory, contacts a MDA such as procmail which writes email to the
disk.  Our task is to extend this in the
which is replicated.

The requirement is that if the host fiddling with the spool file
crashes, another host can take over preserving a consistent view of
the mail spool.

The precise mechanisms involved here will dictate the path to follow.
For example, if syncing files is used, we will need that replicators
have the synced copy of the spool file, before the fsync returns
success.  This means that for fsync InterMezzo will, contrary to all
other updates, engage in write ahead update propagation to servers or
relevant replicators. Locking might mean that a close on a file that
was locked is pushed out synchronously.


These principles will form the basis for InterMezzo recovery.

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic