Improving compression with a preset DEFLATE dictionary

polyfemos76 · on March 30, 2015

We use this technique in Just Dance Now, and I wholeheartedly recommend doing it for usecases where you need to compress lots of small packets individually. We save at least 50% data.

For our game, this consists of the normal communication going back and forth between server and client, where individual messages are normally <100b.

By sampling some user sessions we built a dictionary, fed it into deflateSetDictionary/inflateSetDictionary, and use that in both Node modules for the server, as well as throughout the ecosystem of Android and iOS.

It's very simple to do; we base all communication on standard JSON (for ease of portability and debugging) but then have the preset-dictionary fed into zip (a simple large string!) to counter the weight of JSON.

Needless to say, binary packing the protocol would give additional gains; but the productivity gains with using dictionaries to get good compression of pure JSON was a great choice for us. And it's really supersimple to do.

One nice benefit is that the protocol is extendable by any developer without worrying about the wire-transport; the compression won't break is someone decides to add a parameter to a message, it just most likely won't be compressed as well as the rest of the message.

JoeAltmaier · on March 30, 2015

Hey, somebody has a patent on pre-loading the compression tables with frequently-encountered strings. So heads up.

It was patented in the context of headers for text messages, which are sent for every message, are generally larger than the message but appear exactly once in each text so never end up in the compression table. Each side of the conversation runs the zip algorithm on the header strings, discards the output up to that point, then compress the text message from that point. On receive they preload the algorithm with the checkpointed data structures from the preload-compression.

This Google technique is probably different enough to be off their radar. But who knows.

I think its the guys who do DirectPC

makmanalp · on March 30, 2015

Wow, do you have a link to the patent? It'd be surprising to me if there wasn't prior art. Might be worth asking http://patents.stackexchange.com/ !

JoeAltmaier · on March 30, 2015

Came across it 7 years ago when new at my current project. Did research into efficient compression schemes. Had to change our scheme to avoid the patent. It was Hughes I think.

JoeAltmaier · on March 30, 2015

Ok I searched patents using 'compression dictionary' and came up with a boatload. Folks have been using dictionaries since 2005 at least, to replace common words e.g. XML tags with tokens. E.g. patent 20050027731: Daniel Revel

dantillberg · on March 30, 2015

I love the explanation of how LZ77 works; compression algorithms often seem like they must use some sort of arcane magic, so it's great to lift the veil a bit.

I wonder: how many folks have Little Bunny Foo Foo stuck in their heads after reading this?

userbinator · on March 30, 2015

The "arcane magic" impression likely comes from the heavy use of information theory concepts and maths that are encountered in the traditional literature. Probably necessary to analyse in detail, but quite daunting for the beginner. I find LZ to be particularly intuitive since it embodies the natural notion of "abbreviation" by referring to something we've seen before, and writing compressors/decompressors for the algorithm is relatively easy (especially decompressors.)

On the other hand Huffman is somewhat less easy to grasp intuitively, but some aspects of its principle of reducing bits/symbol based on their frequency shows up in natural languages: the more common a word is, the shorter it tends to be. It's also a bit more difficult to implement.

eru · on April 1, 2015

And as words become more common (over time or in specific parts of the population), they often become shorter, too.

Eg see the evolution from phone / mobile phone, to landline / phone.

ck2 · on March 30, 2015

Now go read about Markov chains which will blow your mind.

Back in the late 80s I read an interview with Phil Katz who mentioned them to possibly improve PKZIP so I looked them up in the library. That's when it dawned on me there were people far, far more intelligent and clever in this world than I would ever be.

mnw21cam · on March 30, 2015

Now go and read about the Burrows-Wheeler block-sort algorithm, which is used by the bzip2 compression program. This can only be a product of a sick and twisted (but quite brilliant) mind.

mbrock · on March 30, 2015

Wow, I remember skipping class to sit in the library and implement the Burrows-Wheeler transform. Such a wonderfully clever algorithm, particularly the inverse step!

draugadrotten · on March 30, 2015

...and while we talk about blown minds and algorithms, be sure to watch the beautofully executed Quick-sort with Hungarian (Küküllőmenti legényes) https://www.youtube.com/user/AlgoRythmics/videos

talideon · on March 30, 2015

LZ77 is pretty trivial as compression algorithms go. Go take a look at this video for a explanation accessible to non-developers: https://www.youtube.com/watch?v=goOa3DGezUA

Dylan16807 · on March 30, 2015

The arcane part is what happens after the repetitions have been found and replaced. At that point you have the entropy encoding stage where more common tokens get shorter encodings and more rare tokens get longer encodings. The basic idea isn't too complex, but then optimizing it adds extra layers to efficiently store which tokens are common and which are rare. This results in something quite arcane indeed.

gopalv · on March 30, 2015

Annoying detail for DEFLATE + dictionary is that the dictionary has to be provided within every reader (setDictionary in java).

The output from the DEFLATE stream is not standalone, which makes it really hard to use - a more ideal approach would've been to provide training to the Huffman tree & guess best window sizing from a sample.

But SDCH solves that problem by storing an external named dictionary - clients will read it off a common HTTP location if it's missing in the cache.

Within that, there's a pretty neat standard which SDCH uses - RFC 3284.

I ran into it trying to optimize holding a few billion serialized blogs within an in-memory store with transactional updates - the issue was mostly the in-memory sizes of serialized objects.

I ended up using a versioned external dictionary for compression, which was an extension to VCDIFF with 1 additional instruction in its streams - COPY_DICT.

That concept can be pushed further, to get to something more generic like femtozip - https://github.com/gtoubassi/femtozip/wiki/How-femtozip-work...

ctz · on March 30, 2015

I wonder if compression with just a preset dictionary (i.e. no learning from the plaintext) would produce reasonable compression ratio. That would be interesting because it would be entirely BREACH resistant.

halflings · on March 30, 2015

Never thought I'd ever understand an article about compression :) Great article (loved the "hopping bunny" example) and it almost makes me think that this could replace the need for dynamically loading content in some cases (when network latency, and not DOM rendering and such, is the issue).

jamix · on March 31, 2015

Unfortunately, the efficiency of the proposed approach is only limited to the first 32 KB of the page being compressed. Once we move past the first 32 KB, LZ77's lookbehind window will no longer include the dictionary and the compression ratio of the rest of the page will stay the same.

Pre-seeding the dictionary could have a better effect in an LZ78 scheme where there is an actual dictionary built and stored in memory as opposed to a moving lookbehind window.

This also makes me wonder how the dictionary is delivered to the end client. If it's sent with each page request, then it hardly makes any sense as you are hauling an extra 16 KB or 32 KB of the dictionary just to have a better chance of compressing the page's first 32 KB.

I wonder if the size of the dictionary itself is factored in the compression benchmarks provided in the article.

ck2 · on March 30, 2015

I've wondered why with storage being so cheap there isn't a compression method with a 1TB dictionary that everyone stores locally and just send the "map" or "tree" instead.

mnw21cam · on March 30, 2015

Simply because indexes into a 1TB dictionary would require 40 bits just for the position, rather than the 15 bits for a 32kB sliding search window. You would also have to implement some sort of indexing system to improve the speed of compression.

That's not to say it isn't possible - rsync does it when updating the contents of a file. I wrote (about five years ago) a program that does it too. What you get is a file that can reconstruct the original file assuming an exact copy of a reference file is present. I use it for my incremental backup system.

ck2 · on March 30, 2015

40bits if you aren't thinking in offsets.

Maybe 40bits for the starting point.

dspillett · on March 30, 2015

A single 1Tb dictionary would be nonsensical as would be far too large for many devices, and you wouldn't be able to use it as an "initial state" dictionary for the most commonly supported compression method (as this article is discussing), and if using it for some other compression method (so the 32Kb window limit is not significant) large chunks of it would be useless for most (if not all) content so the large indexing key size could be an issue.

You could use variable length indexing method to minimise that third point though, much like UTF8 for text - the vast majority of references need not require close to the 40 bits.

A large (up to 1Gb - 32Kb of 32Kb) collection pre-made dictionaries might be more practical though, and you wouldn't have to pre-load the entire data-set on every device/client. Content expecting the dictionary would need to be labelled as such anyway, so ad a dictionary number to that label. If the client didn't already have that dictionary then it would be an extra HTTP request to get it that once but it would always have it for future use (or for smaller devices you might only hold a portion of the collection in a round-robin manner).

Of course someone would have to make the dictionaries, and everyone and his dog would prefer a different set so standardising could be an issue (if there were several "standards" the footprint would grow significantly). You would want a selection optimised for HTML, a selection for CSS, a selection for Javascript, some for English text, and some for other common languages (natural or technical). When compressing you would either need to pick one of the options arbitrarily (for HTML always use dictionary 3, for instance) or try with several and pick the best (which would be CPU intensive for dynamic content though the results could be cached for static). Clients would need to state they supported the new format, and servers would need to be able to serve content not requiring it as needed, servers would also need to host the dictionaries as you don't want to be adding an extra external point of failure. Oh, and now you have the global presets on each host there would need to be some way of protecting against pollution due to reasons of accidental damage (simple on-disk or in-transit corruption), malicious intent (a DoS attempt), or stupidity (some one deciding it'll be fine to use an as yet unallocated standard preset ID for something more local).

To get around the "standard set" issue in some cases you could perhaps support local standards (so the server could ask for "prime the dictionary with local block 5" instead of "prime the dictionary with standard block 123" and have these preset dictionaries cached and expired just like any other other object read over HTTP)

So while fraught with potential problems it is not an idea without merit - something practical could be gained, though there is debate to be had on whether the effort is worth while. I suspect it wouldn't be, but if I won a lottery and had free time on my hands I might be interested to play with the idea further!

EDIT:

Having actually read the article, the local dictionaries method is very similar to what Google do with SDCH. I wonder if other browsers will ever bother to support it... For dynamically loaded content you could implement it yourself in javascript but that would kill external indexing so would be a problem for content you care about SEO and similar regarding (it could be useful for page/site 3rd party comment services though, and others that that already load all their content that way).

marcosdumay · on March 30, 2015

> You could use variable length indexing method to minimise that third point though, much like UTF8 for text - the vast majority of references need not require close to the 40 bits.

It's the other way around. If you used variable length indexes, most of your symbols will be much larger than 40 bits.

Take as an example a set of 256 byte long symbols. If you shorten one of them to a "1" bit, you just used half of the symbols you could represent with that byte, and almost half of them will need to grow at least one bit.

dspillett · on March 31, 2015

Many symbols being longer only matters if they are common (and your encoding method doesn't allow you to leave affected sequences plain to avoid increasing their size), and if, moving away from UFT8-a-like encodings which might have been a bad example, you allow context beyond one symbol you can break them into pages making symbols after a page selection smaller (though the next one that requires a new page to be selected will necessarily be larger).

There are various ways you could arrange the symbol encoding and how deal with those parts of the input that can't be encoded effectively (where the encoding would increase size overall instead of decreasing it). I'm pretty sure it would work, though whether it would be truly practical (in terms of the return from making the effort) rather than just being an interesting little thought experiment is definitely a valid discussion point! The idea of a collection of preset initial dictionaries for LZ77 and similar feels much more practical than the single large reference dictionary for some other scheme.

lmm · on March 30, 2015

Because that doesn't actually help you unless there are particular byte sequences that are more common in "files in general" than others. Imagine storing a "dictionary" of all 100-byte strings. Problem is, there are 2^100 such strings, so any "index" into this dictionary would itself be 100 bytes long.

pestaa · on March 30, 2015

But with computing becoming cheaper, you only need 2 integers to compress any data.

http://penduin.blogspot.hu/2006/10/pi-compression.html

chriswarbo · on March 30, 2015

Two integers and a load of computation? How wasteful.

Just treat your bits as the binary representation of a Natural number and you're done. No computation required for encoding/decoding, and you only need to store one number :)

PepeGomez · on March 30, 2015

It wouldn't work. The pointers would take on average just as many bits as the original file, if not more.

jimktrains2 · on March 30, 2015

I remember a friend and I working that out in High school (2003-2004) and coming to the conclusion that since it requires more of π to find a longer sequence, the start index takes up a significant amount of space. (IIRC, this was a while ago)

FroshKiller · on March 30, 2015

I mean, probably because storage IS cheap. Why do I want the hassle and overhead of this kind of compression if I can just spend a few dollars and double my current storage?

est · on March 30, 2015

Or just make jQuery as builtin.

_kerbal_ · on March 30, 2015

The real solution to that is to allow sites to specify the SHA of their resources. If any file from _any site_ is already in the cache with that SHA, it won't need to be loaded.

There's already some work being done in that general direction: http://www.w3.org/TR/SRI/

bhouston · on March 30, 2015

I'd just prefer something like LZHAM to be implemented as an alternative to gzip, Deflate for http compression. It would be amazing for serving content over the web to browser-based 3D applications like https://Clara.io

Details: https://github.com/richgel999/lzham_codec

corysama · on March 30, 2015

LZHAM is looking great. But, Rich says that it takes a while to warm up. So, it's not great for separated small files. Separate small files is exactly what shared dictionary compression is great for.

anon4 · on March 30, 2015

Can't you just implement that in javascript, then load the entire webpage that way?

bhouston · on March 30, 2015

You can of course use things like:

https://github.com/nmrugg/LZMA-JS

But nothing beats native code for decompression speed and also if it is native it can happen asynchronously outside of the JavaScript's limited threading model (although it is always getting better) and most importantly, it can also happen outside of the JavaScript memory space.

acqq · on March 30, 2015

How is the new deflate preset dict passed to the browser? If the dict is normally empty and the protocol even supports sending of different preset, how is that shorter than before, if the new message has a new dictionary?

jkot · on March 30, 2015

Some databases are creating compression dictionary during compaction.

ot · on March 30, 2015

Interesting, can you mention them?

mamcx · on March 30, 2015

Wonder if this trick could be used agains large binary files (like a database backup)?