[prev in list] [next in list] [prev in thread] [next in thread] 

List:       git
Subject:    Re: hosting git on a nfs
From:       Linus Torvalds <torvalds () linux-foundation ! org>
Date:       2008-11-14 5:01:20
Message-ID: alpine.LFD.2.00.0811132044460.3468 () nehalem ! linux-foundation ! org
[Download RAW message or body]



On Thu, 13 Nov 2008, James Pickens wrote:
> 
> I wonder if there are other completely different parts of git that could
> benefit from multi threading when the work tree is on nfs?

I'm sure there are. That said, threading things is usually really quite 
painful. The only reason this preloading was easy to do was that we really 
had all the data structures laid out beautifully for this, and I had spent 
a lot of effort earlier on a whole series of "avoid duplicate lstat()" 
changes, which gave us that whole ce_uptodate() thing, and all normal 
cases already taking advantage of it, and the "uptodate" bit being 
percolated along all the paths.

If it hadn't been for that, it would have been much nastier to do.

As it was, there was literally just a simple little extra phase to fill in 
all teh data structures that we already had set up in parallel.

> I'm thinking specifically of 'git checkout', since while testing this 
> patch I happened to do a 'git pull' that resulted in several thousand 
> new files being created, and the "Checking out files" part took 
> *forever* to run.

Now, the good news is that the actual work-tree part of checking things 
out is probably pretty amenable to the same kind of parallelization, for 
largely the same reasons: the whole checking out thing is already done in 
multiple phases with all error handling done before-hand. So we will have 
built up all our data structures earlier, and set the CE_UPDATE bit, and 
then there's just a final "push it all out" phase.

So CE_UPTODATE and CE_UPDATE are really very similar in that sense - 
except at opposite ends of the pipeline. The CE_UPTODATE bit marks a name 
entry as matching the filesystem data (and allows all later phases to 
avoid doing the expensive lstat()s), while the CE_UPDATE (and CE_REMOVE) 
bits allow us to do all our complex work in-memory without committing it 
to disk, and then we push it out in one go.

So if you want to multi-thread checkout, you literally need to just thread 
the last for-loop in unpack-trees.c:check_updates() (the CE_UPDATE loop 
that does "checkout_entry()" over the whole index). 

> And FWIW, I timed 50 iterations of 'git diff', and the average runtime
> dropped from 11.7s to 2.8s after this patch.  A nice improvement.

Very impressive. That said, I suspect you get a "superlinear" improvement 
because once it gets faster, the kernel cache also works better, since you 
can do more loops without having the NFS attributes time out.

Whether that kind of effect happens much in actual practice is debatable, 
although it's quite possible that it will work the same way in some 
scripting schenarios.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic