Reference arithmetic coding

This project is a clear implementation of arithmetic coding, suitable as a reference for educational purposes. It is provided separately in Java, Python, and C++, and its code is open source. The code can be used for study, and as a solid basis for modification and extension. Consequently, the codebase optimizes for readability and avoids fancy logic, and does not target the best speed/memory/performance.

Arithmetic coding is an ingenious generalization of Huffman coding that allows each symbol to be coded with a non-whole number of bits (when averaged over the entire message), thus improving compression efficiency. But while Huffman coding is relatively straightforward to understand and implement, arithmetic coding involves many subtle details and requires numbers to be handled very carefully to prevent out-of-range conditions, overflow, etc. I will not attempt to explain arithmetic coding on this page, since there are tutorials readily available from searching the web.

The intent and structure of this arithmetic coding implementation are very similar to my Huffman coding implementation. You might want to study that code first if you encounter difficulty understanding this one.

Source code

Primary: https://github.com/nayuki/Reference-arithmetic-coding

Mirror: https://gitlab.com/Nayuki/Reference-arithmetic-coding

Overview

Arithmetic encoding takes a sequence (stream) of symbols as input and gives a sequence of bits as output. The intent is to produce a short output for the given input. Each input yields a different output, so the process can be reversed, and the output can be decoded to give back the original input.

In this software, a symbol is a non-negative integer. The symbol limit is one plus the highest allowed symbol. For example, a symbol limit of 4 means that the set of allowed symbols is {0, 1, 2, 3}.

The following explains all the submodules in the software package:

Sample applications

Three pairs of command-line programs fully demonstrate how this software package can be used to encode and decode data using arithmetic coding.

The class ArithmeticCompress derives a static frequency table and writes it to the compressed file, and ArithmeticDecompress reads the frequency table and uses it to decode all the symbols.
The classes AdaptiveArithmeticCompress and AdaptiveArithmeticDecompress start with a flat frequency table and update it after each byte is processed, thus making it reflect the statistics of the file being compressed.
The classes PpmCompress and PpmDecompress implement a basic version of prediction by partial matching. In PPM, the frequency predictions for the next symbol is based on the previous n symbols processed.

Encoder/decoder

The classes ArithmeticCoderBase, ArithmeticEncoder, and ArithmeticDecoder implement the basic algorithms for encoding and decoding an arithmetic-coded stream. The frequency table can be changed after encoding or decoding each symbol, as long as the encoder and decoder have the same table at the same position in the symbol stream. At any time, the encoder must not attempt to encode a symbol that has a zero frequency.

Frequency tables

Objects with the interface FrequencyTable keep track of the frequency of each symbol, and provide cumulative frequencies too. The cumulative frequencies are the essential data that drives arithmetic coding.

Bitwise I/O streams

The classes BitInputStream and BitOutputStream are bit-oriented I/O streams, analogous to the standard bytewise I/O streams. However, since they use an underlying bytewise I/O stream, the bit stream’s total length is always a multiple of 8 bits.

Test suite

A JUnit test suite checks that compressing and decompressing an arbitrary byte sequence will give back the same byte sequence. This is done on short, simple sequences and on longer random sequences. The test suite checks the 4 main programs mentioned above (ArithmeticCompress, etc.).

Other languages

The Python and C++ versions of this library were ported from the Java version. They contain all the core functionality provided by the Java version, but not some of the extended features or verbose documentation comments. The API naming and semantics follow the Java version, unless the target language has a more idiomatic way of doing things (e.g. underscore names and native bigint in Python). All language versions encode and decode exactly the same data. For example, it is acceptable to encode a file using ArithmeticCompress.java and then decode it using arithmetic-decompress.py.

Limitations

FrequencyTable works with alphabets of up to Integer.MAX_VALUE-1 (i.e. 2³¹ − 2) symbols in Java, UINT32_MAX-1 (i.e. 2³² − 2) symbols in C++, and unlimited symbols in Python.
FrequencyTable has a maximum total of Integer.MAX_VALUE in Java, UINT32_MAX in C++, and unlimited in Python.

Suggestions

Here are some ideas on how to use, modify, or extend this software:

Speed up the frequency table by using binary indexed trees / Fenwick trees, so that changing a symbol’s frequency and getting cumulative frequencies are both fast in Θ(log n) time.
Improve the prediction by partial matching (PPM) model – statistics, order selection, behavior of escapes, exclusions, etc.
Take an existing compression algorithm that uses Huffman coding, and retrofit it to use arithmetic coding instead. Observe how easy or difficult the adaptation process is, and how much the compression efficiency improves by.

Project Nayuki

Reference arithmetic coding

Source code

Overview

Limitations

Suggestions

More info