Project Nayuki


Unspecified edge cases in the DEFLATE standard

Once upon a time in summer 2011, I wrote a fully functional decompressor for GZIP and DEFLATE. I implemented the DEFLATE decompression algorithm entirely based on the standards document, and not on any pre-existing implementations or test cases or compressed data available in the wild. As I was writing the code, I constantly paid attention to special cases, edge cases, omitted cases, and such, because I wanted to have correct and defensive code.

The DEFLATE format is a compression standard specified in RFC 1951 (plain text or PDF) in year 1996. The format has widespread use, and can be found in protocols and file formats such as ZIP, GZIP, PNG, HTTP, and Git. As with most compression formats, the standards document primarily only specifies how to decompress data, not how to compress it. (Any compressed data that can decompresses to the desired result is considered to be compliant.) The document is definitely one of the better specifications I have read and used. It is relatively careful about specifying things like bit order, some exceptional cases, some weird but allowable behavior, and some invalid codes.

However, in the process of writing my own implementation of a DEFLATE decompressor, I could not help but to scrutinize and question every detail, and I believe I found a number of places where the standards document is unclear about what data is or is not permissible. Almost all the problems that I found are concerning how the Huffman code is specified in the dynamic Huffman code block type (section 3.2.7 in the document).

Range of length symbol #284

The document (section 3.2.5) says that length symbol #284 has 5 extra bits and covers the range [227, 257]. But with 5 bits, it’s in fact possible to cover the range [227, 258]. Furthermore, a length of 258 can already encoded with the length symbol #285. What’s going on here? (This applies to any Huffman coded block, whether using the fixed Huffman code or a dynamic one.)

Code length symbol #16 at beginning of sequence

Code length symbol #16 means to repeat the previous code length for some number of times. But this makes no sense at the beginning of the sequence of code lengths, when no code lengths have been written. The document does not explicitly say whether this is an error or not. (This applies to dynamic Huffman-coded blocks only.)

More code lengths than needed

By using the code length repeat symbol (#16, #17, or #18), it is possible to produce a sequence with more code lengths than needed. (The precise number needed is HLIT + HDIST + 258.) The document does not say if this is an error. (This applies to dynamic Huffman-coded blocks only.)

All zero code lengths for distance code

The document says “One distance code of zero bits means that there are no distance codes used at all”, which covers the case where HDIST = 0. However, if HDIST > 0 and all the symbol still have a code length of 0, is this considered to be the same as the previous case, or is it an error? (This applies to dynamic Huffman-coded blocks only.)

Encodable lengths without distance code

If at least one length symbol has a nonzero code length but the distance code is empty, then using a length symbol makes no sense because the distance symbol that must immediately follow it cannot be decoded. The document does not say if setting up such a code is an error, regardless of whether a length symbol is exercised or not. (This applies to dynamic Huffman-coded blocks only.)

No code for end-of-block symbol

It’s possible make the end-of-block symbol (#256) unencodable by giving it a code length of zero. But this is a bad idea because it forces the assumptions that the stream is infinite and the Huffman codes will never need to change. The real questions is, is this actually legal or not? (This applies to dynamic Huffman-coded blocks only.)

Only one symbol in literal/length code or code length code

The document says that it’s legal for the distance code to have only one symbol. Is it also legal for the literal/length code or code length code to have only one symbol? A legitimate degenerate case for the literal/length code is if the block has no literal output and ends immediately with the end-of-block symbol. (This applies to dynamic Huffman-coded blocks only.)

Number of literal codes

The document says that the number of literal codes is HLIT + 257, and it is in the range [257, 286]. But HLIT is 5 bits, so theoretically the range is [257, 288], which is larger. Is HLIT ≥ 30 actually illegal?

Underfull Huffman code

The Huffman code is specified by the code length for each symbol. It is clearly an error if the code lengths produce an overfull tree (for example, 3 symbols each with code length 1). But it is not clear if a sequence of code lengths that produces an underfull tree is an error or not (for example, 2 symbols each with code length 4). (This applies to dynamic Huffman-coded blocks only, and applies to all three Huffman codes: Code length code, literal/length code, and distance code.)

More info