[prev in list] [next in list] [prev in thread] [next in thread] 

List:       dcms-dev
Subject:    merge triple identification
From:       Jonathan S. Shapiro shap () eros-os ! org
Date:       2000-06-28 13:43:40
[Download RAW message or body]

Okay. I've been thinking about how PRCS uses the entity identifier for
merge, and I've come to two conclusions:

1. It's a good idea.
2. I'm going to have to use a variation.

Josh: since you've been down this rathole already, could you check my
assumptions below? In particular, are there other places where family ID's
are essential that I may not yet have noticed?

Background:

The entity identifier, which I think of as the "family identifier" is common
to all versions of a particular object. The general idea is to have a name
for the set of all versions of "foo.c".  I originally avoided a family
identifier because I wanted to be able to do join/split, but I subsequently
concluded that join/split was a bad idea.

Onwards:

The main problem with family identifiers is that it is not clear how to
generate them in a globally unique way. In the current PRCS design (as I
understand it) these numbers are locally unique on a given server, but two
servers may assign family identifiers to the same content in different ways.
I haven't thought this out fully, but I think that given the way DCMS does
replicate on demand you can get into trouble. That is, given an object with
five versions

    o.1 o.2 o.3 o.4 o.5

It's possible that replication will have caused o.1 and o.2 to have been
replicated generating one family identifier, o.4 o.5 replicated later to
generate a different one, and since o.4 isn't present the connection won't
get made between the two chains.

One strategy I thought about for generating universally unique family
identifiers was to simply use the SHA-1 of the initial checkin, but somebody
will inevitably check in a bunch of files that contain identical copyright
messages (and nothing else) and thereby generate the same family identifier
for independent objects.

Assumptions:

In practice, however, it appears to me that the family identifier doesn't
need to be universally unique. When you traverse an object's history you use
the predecessor chain, not the family identifier, so it's not used there.
The main purpose of the family identifier appears to be narrowing the search
space during the merge algorithm. In fact, I don't really see any other
purpose for them.

Proposal:

If this is true, then it is sufficient to have a good enough random number.
I think I'ld probably concatenate this with the SHA-1 hash of the first
object that is a member of the family, and use the resulting number as the
family identifier. This is NOT a globally unique identifier. In fact, I
hereby christen this number the "NOID" (Not an Object IDentifier).

When performing a merge, you proceed as follows:

First sort the objects in the current and merge sets according to their
noids. Some will be singletons (i.e. they appear in one set but not the
other). Move the singletons to a separate singleton list for later
processing.

Now begin the real work by considering the objects that are left. Partition
these by noid. There should be a large number of partitions (roughly equal
to the number of distinct object pairs in the merge), each consisting of a
small number of objects (usually two). For each partition, apply the
expensive exhaustive search described in my original mail until either (1)
no further objects remain or (2) the algorithm has exhausted the search
within the partition. In the latter case, stick all remaining objects in the
singleton list.

Using the noid has the effect of greatly reducing the complexity order of
the search. The algorithm itself doesn't get any better, but instead of
needing to do an all-pairs comparison, you can almost always use a per-pair
comparison. Further, in the bad cases you still have reduced the "all" to
"all items that collide on a noid".

If no common ancestor is found, we presume that this was an accidental
collision and move the two objects into the respective "singleton" lists --
i.e. into the lists of objects that appeared in one configuration but not
the other.

Unless I missed something obvious, this ought to reduce the order of the
algorithm about as far as is possible.

Oh yes. The real reason to call it a noid is that when two many objects need
to be considered at once I get annoid.

Sorry. Bad nomenclature puns are required by Bell Labs, EROS Group, and
Xanadu traditions.


shap

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic