[prev in list] [next in list] [prev in thread] [next in thread]
List: tar-bug
Subject: [Bug-tar] Feature suggestion: reordering files by extension
From: Jari Aalto <jari.aalto () cante ! net>
Date: 2006-12-13 9:51:01
Message-ID: 87mz5sceey.fsf () w2kpicasso ! cante ! net
[Download RAW message or body]
Please consider ordering the files by extension inside tar as this
produces better compression ratios according to Paul Sladen.
Jari
http://www.paul.sladen.org/projects/compression/
[...]
Quick and dirty
A simpler ordering method, involves clustering based purely on
filename and extension can be produced with a command similar to:
cat filelist.txt | rev | sort | rev > neworder.txt
This sorting process workings by reversing each line in the file;
hello.text would become txet.olleh allowing files with similar
file extensions or basenames to be ordered adjacently. The
filenames are reversed again producing the file order; this method
appears to work well for language-packs containing translated
strings, resulting in a 14% improvement using bzip2 compression
both before and afterwards, or 2% if using gzip (most files are
larger than the 32kB window size).
I came across a paper (without source code) which discusses
pre-ordering for efficient zdelta encoding as well as the tarfile
ordering: Compressing File Collections with a TSP-Based Approach
(PDF)[1]. For this paper, a relatively simple, greedy method is
chosen, yeilding compression improvements of ~10-15% on webpages
of online news services.
[1]
http://cis.poly.edu/tr/tr-cis-2004-02.pdf
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic