'Re: [darcs-devel] Improving pull performance'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       darcs-devel
Subject:    Re: [darcs-devel] Improving pull performance
From:       David Roundy <droundy () darcs ! net>
Date:       2005-07-30 12:26:57
Message-ID: 20050730122650.GF32188 () abridgegame ! org
[Download RAW message or body]

On Fri, Jul 29, 2005 at 04:21:08PM +0200, Florian Weimer wrote:
> Has anybody thought about improving pull performance?
> 
> I think it might be useful to add a cache for the remote
> _darcs/inventories/* and _darcs/inventory files, and use zsync to make
> downloads of _darcs/inventory incremental.

Are you thinking about optimizing the "no changes" case?

It seems like there are a few issues here, and I'd rather not address an
optimization at the transport level if there are logical-level
optimizations that could make those redundant.  I.e. rather than caching to
avoid transport, I'd like to avoid downloading any data we don't need.  I
don't see any reason why we should need zsyncish optimizations for fetching
the inventory, unless perhaps the inventory is very large because there
aren't any tags.  And as long as the inventory is small, latency will
dominate, and a simple download should beat zsync in speed.

Tagging regularly and optimizing can reduce the size of _darcs/inventory,
which helps as long as you don't need to delve into _darcs/inventories/.
When darcs tag is run, the inventory is automatically split, but when one
pushes a tag this doesn't happen.  Perhaps a flag to make apply
automatically optimize when it applies a tag would be helpful.

In many cases, optimize --reorder can help prevent the need to delve into
_darcs/inventories/.  Perhaps we could consider an option to pull which
would reorder to match the remote repository--this helps not just with
transport, but also with the amount of commutation needed to perform the
pull.

We also may be able to create improved versions the get_common_and_uncommon
and related algorithms which would be more asymmetric, in that they'd try
to access less of the remote repository and more of the local one.  This
could be tricky (since those algorithms are tricky), but would be an
improvement that could relatively easily benefit several commands, and
could downright eliminate the need to look at any _darcs/inventories/ files
that correspond to tags that we have in our local repository.

> zsync is available here: <http://zsync.moria.org.uk/>
> It doesn't need any special server support, it's all client-side.

Well, it does require that .zsync files be built (or rather stored) on the
server, containing checksum information.  Darcs would have to be in charge
of updating the .zysinc files... but there's no reason that should be a
problem.

A related idea would be to leverage the proposed hashed inventories (which
would of course have to be implemented) to (optionally) store a cache of
all _darcs/inventories/* and patches in a centralized location.  With
hashed inventories (also useful for signed repositories) we'd then be able
to avoid ever downloading the same patch or inventory file twice.  This
wouldn't handle the large _darcs/inventory issue, but would essentially
eliminate the cost of downloading _darcs/inventories/ (since those files
very rarely change).

(And I keep hoping that someone will implement the hashed inventories
idea... I'd help with design, critiquing and especially with the RepoFormat
code to enable forwards and backwards compatibility.)

Back to the subject of pull, if you have a particular "common case"
scenario that you think could use improvement, I think it'd be helpful to
discuss that case in detail before deciding what would be the best way to
improve darcs' performance.
-- 
David Roundy
http://www.darcs.net

_______________________________________________
darcs-devel mailing list
darcs-devel@darcs.net
http://www.abridgegame.org/cgi-bin/mailman/listinfo/darcs-devel
[prev in list] [next in list] [prev in thread] [next in thread]