'Re: bunzip2 fails to decompress pbzip2-compressed files'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       busybox
Subject:    Re: bunzip2 fails to decompress pbzip2-compressed files
From:       Rob Landley <rob () landley ! net>
Date:       2010-10-31 22:17:34
Message-ID: 201010311717.34640.rob () landley ! net
[Download RAW message or body]

On Thursday 28 October 2010 16:10:35 Denys Vlasenko wrote:
> On Friday 22 October 2010 20:07, Dan Fandrich wrote:
> > pbzip2 is a parallel bzip2 compressor that uses multiple threads while
> > compressing to linearly speed bzip2 compression by the number of cores
> > available (see http://compression.ca/pbzip2/). The files it produces
> > are compatible with traditional bzip2, but have slightly different output
> > because of the way the independently-compressed blocks are concatenated.
> >
> > Unfortunately, this seems to prevent Busybox's bunzip2 from decompressing
> > them fully. Busybox decompresses the first block, then silently stops
> > without indicating any sort of error.  It can be reproduced like this:
> >
> > $ cp /lib/libc-2.9.so bigfile   #need a file larger than 900K
> > $ pbzip2 -9 bigfile
> > $ bzcat bigfile.bz2 | wc -c
> > 1331404
> > $ busybox bzcat bigfile.bz2 | wc -c
> > 900000
> > $ busybox bunzip2 bigfile.bz2
> > $ echo $?
> > 0
> > $ stat -c %s bigfile
> > 900000

So pbzip2 is producing nonstandard files that work by coincidence on the old 
implementation, and we need to change to accomodate this.  Ok.

> Please try attached patch

>diff -ad -urpN busybox.6/archival/libunarchive/decompress_bunzip2.c
> busybox.7/archival/libunarchive/decompress_bunzip2.c ---
> busybox.6/archival/libunarchive/decompress_bunzip2.c	2010-10-28
> 19:04:00.000000000 +0200 +++
> busybox.7/archival/libunarchive/decompress_bunzip2.c	2010-10-28
> 23:06:53.000000000 +0200 @@ -44,7 +44,7 @@
> #define RETVAL_LAST_BLOCK               (-1)
> #define RETVAL_NOT_BZIP_DATA            (-2)
> #define RETVAL_UNEXPECTED_INPUT_EOF     (-3)
>-#define RETVAL_SHORT_WRITE              (-4)
>+//#define RETVAL_SHORT_WRITE              (-4)

Are you saying that if the disk fills up or similar, not being able to detect 
this is an improvement?  Or are you saying the wrapper functions will die() 
for us, which they didn't used to do?

If you're going to remove this, why not _remove_ it?  Why accumulate commented 
out code?

> #define RETVAL_DATA_ERROR               (-5)
> #define RETVAL_OUT_OF_MEMORY            (-6)
> #define RETVAL_OBSOLETE_INPUT           (-7)
>@@ -584,8 +584,8 @@ int FAST_FUNC read_bunzip(bunzip_data *b
> /* Because bunzip2 is used for help text unpacking, and because
> bb_show_usage() should work for NOFORK applets too, we must be extremely
> careful to not leak any allocations! */
>-int FAST_FUNC start_bunzip(bunzip_data **bdp, int in_fd, const unsigned
> char *inbuf, -						int len)
>+int FAST_FUNC start_bunzip(bunzip_data **bdp, int in_fd,
>+		const void *inbuf, int len)

inbuf is an unsigned char array.  Honest and truly it is.  Specifically, if you 
don't say "unsigned" you introduce bugs on different platforms based on whether 
or not "char" defaults to unsigned, which is why I was careful to get it right 
and be consistent about it.

What's all this FAST_FUNC nonsense anyway?  If the compiler can't optimize, fix 
the compiler.  This is DEEPLY NOT OUR PROBLEM.

> 	/* Init the CRC32 table (big endian) */
> 	crc32_filltable(bd->crc32Table, 1);
>@@ -652,37 +654,59 @@ IF_DESKTOP(long long) int FAST_FUNC
> unpack_bz2_stream(int src_fd, int dst_fd)
> {
> 	IF_DESKTOP(long long total_written = 0;)
>+		bunzip_data *bd;
> 	char *outbuf;

Why does the indentation change here?

>-	bunzip_data *bd;
> 	int i;
>+	unsigned len;
>
> 	outbuf = xmalloc(IOBUF_SIZE);
>-	i = start_bunzip(&bd, src_fd, NULL, 0);
>-	if (!i) {
>-		for (;;) {
>-			i = read_bunzip(bd, outbuf, IOBUF_SIZE);
>-			if (i <= 0) break;
>-			if (i != full_write(dst_fd, outbuf, i)) {
>-				i = RETVAL_SHORT_WRITE;
>-				break;
>+	len = 0;
>+	while (1) { /* "Process one BZ... stream" loop */

I still viscerally cringe when I see while (1) instead of for(;;).  I'm aware 
that modern optimizers take it out, but when there is a way to state exactly 
what you want the code to do and you choose to instead say something you 
_don't_ want the code to actually look like, I don't understand why.  Oh well.

>+		i = start_bunzip(&bd, src_fd, outbuf + 2, len);
>+
>+		if (i == 0) {
>+			while (1) { /* "Produce some output bytes" loop */
>+				i = read_bunzip(bd, outbuf, IOBUF_SIZE);
>+				if (i <= 0)
>+					break;
>+				if (i != full_write(dst_fd, outbuf, i)) {
>+					bb_error_msg("short write");
>+					goto release_mem;
>+				}
>+				IF_DESKTOP(total_written += i;)

If you say:

  if (DESKTOP) total_written += i;

The result will always be syntax checked.  (Local variables that are never 
used get optimized out.)

IF_DESKTOP(xxx) is more concise way of doing an #ifdef.  I originally came up 
with it to chop out varargs in functions without needing multi-line #ifdef 
blocks for each one.  As with all #ifdefs, it introduces the possibility that 
different configurations will build break.  It's also yet another non-obvious 
non-C construct that we have to EXPLAIN to people.  Not explaining why doing 
it that way is smaller or faster or more portable, but explaining what the 
heck it does in the first place and how to touch the code without breaking the 
increasingly brittle build.

Why on earth are there archival/bbunzip_test*.sh files outside of the test 
suite directory?

Why are there half a dozen archival/lzo*.c files in a directory that used to 
just have things like "rpm.c", "tar.c" with all the actual compression code 
living in libunarchive?

When did clean code stop mattering to this project?

I'm sorry, I can't look at this anymore.  It's not fun.  I'm going to go write 
a shell script version of zapchroot rather than trying to make busybox umount 
-a do it, and just stop bothering you all.  Go have fun writing fractal code 
nobody else can understand.  Micro-optimize until it shatters.  Outdo the FSF.

I was in it for the simple, and there is no more simple here.  It's not a 
question of doing cleanup, it's a question of the project's goals clearly no 
longer valuing "simple" or "clean" even slightly.  The code gets uglier every 
release.  As far as I can tell the only reason you're still writing in C is 
because hand-coded assembly isn't portable.

I can barely SEE the code buried under all this extraneous crap, and if I 
wanted to dig for it there are plenty of other projects I could do so in.

Rob

P.S.  I just whipped up a first pass of the shell script, it's attached.
-- 
GPLv3: as worthy a successor as The Phantom Menace, as timely as Duke Nukem 
Forever, and as welcome as New Coke.

["zapchroot" (application/x-shellscript)]

#!/bin/bash

# Copyright 2010 Rob Landley <rob@landley.net> licensed under GPLv2

if [ "$1" == "-d" ]
then
  DELETE=1
  shift
fi

# Clean up a chroot directory

ZAP=$(readlink -f "$1" 2>/dev/null)

if [ ! -d "$ZAP" ]
then
  echo "usage: zapchroot [-d] dirname"
  exit 1
fi

i="$(readlink -f "$(pwd)")"
if [ "$ZAP" == "${i:0:${#ZAP}}" ]
then
  echo "Sanity check failed: cwd is under zapdir" >&2
  exit 1
fi

# Iterate through the second entry of /proc/mounts in reverse order

for i in $(awk '{print $2}' /proc/mounts | tac)
do
  # De-escape octal versions of space, tab, backslash, newline...
  i=$(echo -e "$i")

  # Skip entries that aren't under our chroot
  [ "$ZAP" != "${i:0:${#ZAP}}" ] && continue

  echo "Umounting: $i"
  umount "$i"
done

_______________________________________________
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

[prev in list] [next in list] [prev in thread] [next in thread]