# Reference arithmetic coding

This project is a clear implementation of arithmetic coding, suitable as a reference for educational purposes. It is provided separately in Java and Python, and is open source. The code can be used for study, and as a solid basis for modification and extension. Consequently, the codebase optimizes for readability and avoids fancy logic, and does not target the best speed/memory/performance.

Arithmetic coding is an ingenious generalization of Huffman coding that allows each symbol to be coded with a non-whole number of bits (when averaged over the entire message), thus improving compression efficiency. But while Huffman coding is relatively straightforward to understand and implement, arithmetic coding involves many subtle details and requires numbers to be handled very carefully to prevent out-of-range conditions, overflow, etc. I will not attempt to explain arithmetic coding on this page, since there are tutorials readily available from searching the web.

The intent and structure of this arithmetic coding implementation are very similar to my Huffman coding implementation. You might want to study that code first if you encounter difficulty understanding this one.

## Source code

Browse the full source code at GitHub: https://github.com/nayuki/Reference-arithmetic-coding

Or download a ZIP of all the files: https://github.com/nayuki/Reference-arithmetic-coding/archive/master.zip

## Overview

Arithmetic encoding takes a sequence (stream) of symbols as input and gives a sequence of bits as output. The intent is to produce a short output for the given input. Each input yields a different output, so the process can be reversed, and the output can be decoded to give back the original input.

In this software, a symbol is a non-negative integer. The symbol limit is one plus the highest allowed symbol. For example, a symbol limit of 4 means that the set of allowed symbols is {0, 1, 2, 3}.

The following explains all the submodules in the software package:

- Sample applications
Two pairs of command-line programs fully demonstrate how this software package can be used to encode and decode data using arithmetic coding. One pair of programs is the classes

`ArithmeticCompress`

and`ArithmeticDecompress`

, which uses a static frequency table. The other pair of programs is the classes`AdaptiveArithmeticCompress`

and`AdaptiveArithmeticDecompress`

, which uses an adaptive/dynamic frequency table.- Encoder/decoder
The classes

`ArithmeticCoderBase`

,`ArithmeticEncoder`

, and`ArithmeticDecoder`

implement the basic algorithms for encoding and decoding an arithmetic-coded stream. The frequency table can be changed after encoding or decoding each symbol, as long as the encoder and decoder have the same table at the same position in the symbol stream. At any time, the encoder must not attempt to encode a symbol that has a zero frequency.- Frequency tables
Objects with the interface

`FrequencyTable`

keep track of the frequency of each symbol, and provide cumulative frequencies too. The cumulative frequencies are the essential data that drives arithmetic coding.- Bitwise I/O streams
The classes

`BitInputStream`

and`BitOutputStream`

are bit-oriented I/O streams, analogous to the standard bytewise I/O streams. However, since they use an underlying bytewise I/O stream, the bit stream’s total length is always a multiple of 8 bits.- Test suite
A JUnit test suite checks that compressing and decompressing an arbitrary byte sequence will give back the same byte sequence. This is done on short, simple sequences and on longer random sequences. The test suite checks the 4 main programs mentioned above (

`ArithmeticCompress`

, etc.).

## Limitations

All the arithmetic coding related code only works with alphabets with up to

`Integer.MAX_VALUE - 1`

(i.e. 2^{31}− 2) symbols.`FrequencyTable`

has a maximum total of`Integer.MAX_VALUE`

.

## Suggestions

Here are some ideas on how to use, modify, or extend this software:

Speed up the frequency table by using Fenwick trees, so that changing a symbol’s frequency and getting cumulative frequencies are both fast in

`O`(log`n`) time.Implement the prediction by partial matching (PPM) model as a front-end, and use this arithmetic coding implementation as the entropy coder back-end.

Take an existing compression algorithm that uses Huffman coding and retrofit it to use arithmetic coding instead. Observe how easy or difficult the adaptation process is, and how much the compression efficiency improves by.