[prev in list] [next in list] [prev in thread] [next in thread] 

List:       zlib-devel
Subject:    [Zlib-devel] deflateSetDictionary(): How to determine "most commonly used strings"?
From:       madler () alumni ! caltech ! edu (Mark Adler)
Date:       2010-04-17 15:23:37
Message-ID: 64E8A8B8-0D48-42F3-A469-C16DC476535B () alumni ! caltech ! edu
[Download RAW message or body]

On Apr 16, 2010, at 4:07 PM, John Bowler wrote:
> the dictionary just behaves as though it was prefixed on front of the data to be \
> compressed.
...
> So far as I can see the dictionary can only have an effect on the first window-size \
> bytes of the uncompressed data, because after that the dictionary bytes are no \
> longer visible to the compression algorithm.

Correct.

> 1) HTML files all start the same way.  The most obvious thing to put, right at the \
> *start* of the dictionary, is that block starting "<html..."; that's an instant \
> saving of 20 or more bytes.

Actually you want to put the most likely strings to be repeated at the end of the \
dictionary, not the start.  The end of the dictionary will provide shorter distances, \
which are coded in fewer bits.

On Apr 16, 2010, at 6:37 PM, Greg Roelofs wrote:
> This is not a commonly used option; it adds complexity with relatively
> little benefit except when compressing a whole bunch of similar, small
> files.

In fact I know of an application that used text messages for transmitting vending \
machine data that greatly benefitted from this, where it would use the previous up to \
32K of messages (which was many messages) as the dictionary for the next message.  \
There was a return path for retransmission, so if the receiver lost lock on the \
dictionary, the message was sent with no dictionary and the process started over.

On Apr 17, 2010, at 1:10 AM, Peter Elmer wrote:
> I've been meaning since some time to ask this same question regarding a 
> zlib interface for dictionary discovery.

Actually, you don't need a zlib interface.  You can get the same information from the \
compressed data itself.  infgen ( http://zlib.net/infgen.c.gz ) will "disassemble" a \
deflate stream into readable descriptions of the contents.  The matches could perhaps \
be used to aid in dictionary creation.

Mark


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic