'Patch fuzz factor design ruminations.'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       busybox
Subject:    Patch fuzz factor design ruminations.
From:       Rob Landley <rob () landley ! net>
Date:       2010-09-26 23:38:43
Message-ID: 201009261838.43659.rob () landley ! net
[Download RAW message or body]

So once again, a patch didn't apply because of fuzz factor, and I'm getting 
tired of it.  Implementing fuzz support isn't technically hard, but getting 
the design right _is_, so I thought I'd publically mull it over while working 
out what to do.

Fuzz factor is different from a patch offset.  Applying patches at offsets is 
normal, and in fact the patch algorithm I implemented ignores the suggested 
location entirely.  It actually operates in a streaming manner, like a really 
weird form of sed.  It just finds the first place to apply the patch, and if a 
hunk doesn't apply it hits the end of the file and bails out there.  This is a 
reasonably simple and low-memory way of doing it, which works because the 
pattern to apply has (generally three) leading context lines which have to 
match, (generally three) trailing context lines which have to match, and the 
lines removed by the patch also have to match.  This is a fairly reliable 
identifier of what needs to change, there's enough information to reliably 
identify the correct location, even without the offset.

Fuzz factor, on the other hand, says "not all of those lines are going to 
match".  My understanding is that a fuzz factor of 1 says strip two context  
lines (one from the beginning, one from the end).  A fuzz factor of 2 says 
strip four context lines (the first two and the last two).

This leads to the problem of hunks like this:

@@ blah blah blah
context
context
 	{
+ insert
+ insert
+ insert
 
context
context

With a fuzz factor of 2, and no deleted lines to match, that can insert 
anywhere that has a curly bracket at the right indentation level followed by a 
blank line.  This causes SUBTLE BUGS which the linux-kernel guys complain 
about from time to time.  Guys like Andrew Morton, Al Viro, and Dave Jones 
have all complained about gnu's default fuzz factor fallbacks on a conceptual 
level, it makes a heroic attempt to apply stale patches and winds up mis-
applying them rather than breaking and forcing people to fix up a version-
skewed patch.

However, I'm playing around with automating the Linux From Scratch 6.6 build 
on top of my Aboriginal Linux project, and the patches they're applying to the 
packages in the lfs-6.6-source tarball have a fuzz factor of 2, ala:

Applying /home/landley/aboriginal/aboriginal/build/host-temp/lfs-
bootstrap/lfs/coreutils-8.4-uname-1.patch
patching file src/uname.c
Hunk #2 succeeded at 314 with fuzz 2.
Hunk #3 succeeded at 441 with fuzz 1.
Hunk #4 succeeded at 449 with fuzz 2.

(Nope, not the only example, diffutils-2.8.1-i18n-1.patch patches src/diff.h 
with fuzz 2, and so on.  Apparently, if patch doesn't refuse to apply it they 
see no need ot upgrade it.)

Fuzz factor is just trimming context lines.  I can do that.  But under what 
circumstances should I?  (Especially since I'm _ignoring_ the offset 
information, which can only make the mis-applied fuzz thing worse.)

Right now, the pathological case for applying a patch is 6 lines of context: 3 
leading, 3 trailing, and all insertions with no deletions.  That's fairly 
reliable.  If I count lines deleted in the body of the patch as additional 
information, then I can auto-set a fuzz factor based on still needing to match 
at least 6 lines to be happy that I'm applying the hunk at a good place.

However, that won't help this hunk, which _is_ the "Hunk #2 succeeded at 314 
with fuzz 2" example above, and in fact the first one busybox complained about 
not being able to apply:

@@ -308,6 +314,96 @@
 	if (0 <= sysinfo (SI_ARCHITECTURE, processor, sizeof processor))
 	  element = processor;
       }
+#else
+      {
BLAH BLAH BLAH nothing but insertion for many lines
+#endif
+      }
 #endif
 #ifdef UNAME_PROCESSOR
       if (element == unknown)

The entire hunk is one big insertion, with no deletion to add additional 
context.  With fuzz 2 (stripping 2 context lines from the beginning, and 2 
from the end), the remaining context is a curly bracket and an #endif.  Yeah, 
not likely to find _those_ together at some random place in a C file.  As far as 
I can tell, this file is only still applying correctly by sheer coincidence.

And it's EXACTLY the kind of thing that may produce code that still compiles, 
but doesn't do what the author intended, and no human's looked at it since it 
changed so they won't notice until they hit that bug, and will then be stumped 
because _they_ didn't move it so won't _see_ that something else did...

So once again, I know how to implement fuzz factor, I can probably even come 
up for sane rules when fuzz factor can be applied safely... and it won't fix 
the problem in front of me.

Anybody have an opinion?  Because I'm stumped.  The code is _aware_ of current 
offset (it's calculating the line count in case it needs to display it), maybe 
I can work current offset into the fuzz factor calculations.  But it's still 
not going to be _reliable_...

Sigh.

Rob
-- 
GPLv3: as worthy a successor as The Phantom Menace, as timely as Duke Nukem 
Forever, and as welcome as New Coke.
_______________________________________________
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox
[prev in list] [next in list] [prev in thread] [next in thread]