Project Nayuki

DEFLATE library (Java)


This is a follow-up to my simple DEFLATE decompressor in Java, now optimizing for speed but increasing code complexity and decreasing modularity. Both implementations are still designed to correctly handle all possible error conditions that can occur in DEFLATE-format compressed data.

Browse the project’s source code at GitHub:

Or download a ZIP of all the files:


Implementation Run time Input speed Output speed on Oracle JRE, using zlib (C library) 10.92 s 93.7 MiB/s 265.5 MiB/s
Nayuki’s DEFLATE library (this page) (pure Java) 17.52 s 58.4 MiB/s 165.5 MiB/s
Nayuki’s simple DEFLATE implementation (pure Java) 73.78 s 13.9 MiB/s 39.3 MiB/s

From these results, we can see that my (fast) DEFLATE library in Java is 0.62× the speed of (which uses the zlib native library implemented in C). This is a reasonable speed considering that pure-numeric code in Java is about half as fast as the same algorithm compiled in C, due to the JVM’s weaker JIT compiler and the fancier guarantees of memory safety, concurrency, and garbage collection in Java. Secondly, we see that my DEFLATE library is 4.2× the speed of my simple DEFLATE implementation, which is a respectable gain for making the codebase so much harder to read.

The input speed is the rate at which compressed data is consumed, and the output speed is the rate at which uncompressed data is produced. In a typical decompressor, both activities happen at the same time. The two numbers are linked by the compression ratio, which for this test data file is 2.83. The reason both numbers are quoted is because for different compressed files, the input rate is limited by the Huffman decoding speed (which might process one bit at a time in the worst case), and the output rate is limited by the dictionary copy speed (which might copy long runs of bytes using only a few Huffman-coded commands).

Computer system: Oracle Java 1.8.0 Update 45 (64-bit), Intel Core i5-4690 (Haswell) 3.50 GHz, Windows 8.1 Pro (64-bit).

Test data file: enwiki-20140102-page.sql.gz from Wikipedia’s database dumps. Characteristics:

  • Compressed size: 1 073 067 022 bytes (includes ~20 bytes of gzip headers and footers).
  • Decompressed size: 3 040 509 604 bytes of SQL text data.
  • Number of DEFLATE data blocks: 20 476.
  • Block types used: Only type 2 (dynamic Huffman). (In particular, uncompressed blocks are not used.)
  • Median decompressed block size: ~144 000 bytes.
  • Number of Huffman-coded symbols: 305 349 777 literal bytes, 346 661 071 run lengths (and also this many distance symbols), 20 476 end-of-block symbols.
  • Average LZ77 run length: 7.89 bytes.

Improvements and notes

Huffman code tree as an array

The Huffman code tree was a tree of objects with various fields and pointers; now it is a flat array of integers representing the same information. This reduces the needless overhead from reading/writing objects and dereferencing pointers.

The array elements represent internal nodes (which are other array indexes) and leaf symbols. See InflaterInputStream.codeLengthsToCodeTree() for a comment explaining the details of this data structure.

Fast Huffman code table

For even faster decoding of symbols, the Huffman code tree is converted into a look-up table that allows up to n bits (configurable) to be consumed in one shot, yielding what symbol or internal node was decoded and how many bits were actually consumed. This feature interacts favorably with the buffered input bit stream.

Because of these two changes, the decoding of Huffman-coded symbols is vastly sped up and the logic of decodeSymbol() is heavily modified.

Input stream bit buffering

The decompressor keeps a buffer of the next 0 to 63 input bits (variable amount). If a request to read n bits can be fulfilled from the buffer, then it is done in one shot instead of reading each bit individually from the input stream.

If this feature by itself is back-ported to the previous simple DEFLATE implementation (with heavy hacking and patching), it doesn’t yield much speed-up by itself.

Integrated classes

The buffered bit input stream and the circular dictionary have been incorporated into the inflater class’s code. This change reduces the number of object reference indirections when executing the code (thus increasing speed), but it clutters the class’s source code.

Input stream detaching

This decompressor gives control of the underlying input stream back to the caller when the entire DEFLATE data is decompressed. Specifically, it consumes the exact number of bytes needed for decompression and not a single byte more, so the input stream after decompression is at the point immediately following the compressed data.

Because of this feature, the input stream is required to be markable, which is supported by BufferedInputStream, my MarkableFileInputStream, and others.

The simple decompressor implementation also achieves the behavior of not reading more bytes than necessary, but it accomplishes it by reading one byte at a time when requested, which is inefficient because there are no opportunities to read ahead for buffering.

Bounded memory usage

The simple decompressor writes the entire output data to an in-memory array before returning, whereas the fast decompressor writes data to the caller’s array on the fly and uses only a bounded amount of memory for buffers and data structures.

Relatedly, it is possible to modify the code to entirely avoid heap allocation during decompression (except for creating exception objects to throw), so that all the memory allocation is performed when an InflaterInputStream object is created. This can be accomplished by pre-allocating the arrays used by decodeHuffmanCodes(), codeLengthsToCodeTree(), and codeTreeToCodeTable().

Unchanged Huffman code decoding

The logic for decoding Huffman codes (for the dynamic Huffman code block type) is essentially unchanged. This code executes infrequently, and there are few opportunities to eliminate inefficiencies.