Project Nayuki


Reference arithmetic coding

This project is a clear implementation of arithmetic coding, suitable as a reference for educational purposes. It is provided separately in Java and Python, and is open source. The code can be used for study, and as a solid basis for modification and extension. Consequently, the codebase optimizes for readability and avoids fancy logic, and does not target the best speed/memory/performance.

Arithmetic coding is an ingenious generalization of Huffman coding that allows each symbol to be coded with a non-whole number of bits (when averaged over the entire message), thus improving compression efficiency. But while Huffman coding is relatively straightforward to understand and implement, arithmetic coding involves many subtle details and requires numbers to be handled very carefully to prevent out-of-range conditions, overflow, etc. I will not attempt to explain arithmetic coding on this page, since there are tutorials readily available from searching the web.

The intent and structure of this arithmetic coding implementation are very similar to my Huffman coding implementation. You might want to study that code first if you encounter difficulty understanding this one.

Source code

Browse the full source code at GitHub: https://github.com/nayuki/Reference-arithmetic-coding

Or download a ZIP of all the files: https://github.com/nayuki/Reference-arithmetic-coding/archive/master.zip

Overview

Arithmetic encoding takes a sequence (stream) of symbols as input and gives a sequence of bits as output. The intent is to produce a short output for the given input. Each input yields a different output, so the process can be reversed, and the output can be decoded to give back the original input.

In this software, a symbol is a non-negative integer. The symbol limit is one plus the highest allowed symbol. For example, a symbol limit of 4 means that the set of allowed symbols is {0, 1, 2, 3}.

The following explains all the submodules in the software package:

Sample applications

Two pairs of command-line programs fully demonstrate how this software package can be used to encode and decode data using arithmetic coding. One pair of programs is the classes ArithmeticCompress and ArithmeticDecompress, which uses a static frequency table. The other pair of programs is the classes AdaptiveArithmeticCompress and AdaptiveArithmeticDecompress, which uses an adaptive/dynamic frequency table.

Encoder/decoder

The classes ArithmeticCoderBase, ArithmeticEncoder, and ArithmeticDecoder implement the basic algorithms for encoding and decoding an arithmetic-coded stream. The frequency table can be changed after encoding or decoding each symbol, as long as the encoder and decoder have the same table at the same position in the symbol stream. At any time, the encoder must not attempt to encode a symbol that has a zero frequency.

Frequency tables

Objects with the interface FrequencyTable keep track of the frequency of each symbol, and provide cumulative frequencies too. The cumulative frequencies are the essential data that drives arithmetic coding.

Bitwise I/O streams

The classes BitInputStream and BitOutputStream are bit-oriented I/O streams, analogous to the standard bytewise I/O streams. However, since they use an underlying bytewise I/O stream, the bit stream’s total length is always a multiple of 8 bits.

Test suite

A JUnit test suite checks that compressing and decompressing an arbitrary byte sequence will give back the same byte sequence. This is done on short, simple sequences and on longer random sequences. The test suite checks the 4 main programs mentioned above (ArithmeticCompress, etc.).

Limitations

Suggestions

Here are some ideas on how to use, modify, or extend this software:

More info