[prev in list] [next in list] [prev in thread] [next in thread] 

List:       tar-bug
Subject:    [Bug-tar] Feature suggestion: reordering files by extension
From:       Jari Aalto <jari.aalto () cante ! net>
Date:       2006-12-13 9:51:01
Message-ID: 87mz5sceey.fsf () w2kpicasso ! cante ! net
[Download RAW message or body]


Please consider ordering the files by extension inside tar as this
produces better compression ratios according to Paul Sladen.

Jari

    http://www.paul.sladen.org/projects/compression/
    [...]
    Quick and dirty

    A simpler ordering method, involves clustering based purely on
    filename and extension can be produced with a command similar to:

    cat filelist.txt | rev | sort | rev > neworder.txt

    This sorting process workings by reversing each line in the file;
    hello.text would become txet.olleh allowing files with similar
    file extensions or basenames to be ordered adjacently. The
    filenames are reversed again producing the file order; this method
    appears to work well for language-packs containing translated
    strings, resulting in a 14% improvement using bzip2 compression
    both before and afterwards, or 2% if using gzip (most files are
    larger than the 32kB window size).

    I came across a paper (without source code) which discusses
    pre-ordering for efficient zdelta encoding as well as the tarfile
    ordering: Compressing File Collections with a TSP-Based Approach
    (PDF)[1]. For this paper, a relatively simple, greedy method is
    chosen, yeilding compression improvements of ~10-15% on webpages
    of online news services.

[1]
http://cis.poly.edu/tr/tr-cis-2004-02.pdf



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic