Optimal text segmentation for QR Codes

Introduction

The QR Code standard offers multiple ways to encode text. The text string can be split into segments, and each segment can have a different mode. The primary 4 modes are numeric, alphanumeric, byte, and kanji. Each mode supports a different set of representable characters and has a different bit cost per character.

The demo here shows that we can find the optimal way to split any given string into QR Code segments with the best modes. The technique of dynamic programming is used to minimize the total number of bits to encode.

Live application (JavaScript)

User input

Show example text:

Text string:

Error correction level:

Low Medium Quartile High

Force minimum version:

(between 1 and 40)

Program output

Optimal text split

Segment details

Idx	Mode	Head bits	Data bits	Seg bits	Chars	Text string

Table column explanations:

Various statistics

Property	Optimal	Byte-only	Savings
Text length (code points)
Number of segments			N/A
QR Code version
Total segment bits

Table explanations:

Notes

For each segment, its number of header bits depends on the mode and the QR Code version. The header bits strictly increase when going from version 9 to version 10 and when going from 26 to 27. The only exception is that in byte mode, the header bits don’t increase from version 26 to 27. Because of this behavior, a text segmentation that’s optimal for versions 1 to 9 might not be optimal for other versions.
The algorithm starts at the lowest version number, computes the optimal segmentation, and adds up all the segment bits. If all the data and error correction code codewords fit in a QR Code of that version, then we’re done. Otherwise it increments the version number and repeats the process. This is why it’s necessary to specify the error correction level, because it influences the minimum version needed to contain a given text string. Also, the minimum version user input can be used to artificially force the version higher to see how it affects the segmentation.
This JavaScript demo only computes how a given text string should be split into optimal segments and what the final size is. However, it doesn’t actually encode the bits or generate QR Codes. The source TypeScript code and compiled JavaScript code are available for viewing. In the codebase, the most important function which performs the dynamic programming to compute the optimal segmentation is computeCharacterModes().
My Java QR Code library has a class (QrSegmentAdvanced) which performs the optimal segmentation and encodes the character bits. Then, the list of segments can be fed into the QR Code generator to make a barcode. This functionality is available exclusively in the Java language port, not the other languages.
The QR Code specification appendix section gives an algorithm to minimize the bit stream length, having the same goal as my algorithm. But the wording in the QR spec is difficult to understand, I believe it leaves out some cases, and I believe there are situations where their algorithm doesn’t yield the smallest possible output. Whereas I believe my independently formulated algorithm is complete and correct (minimal) for all possible inputs. Either way, note that optimization is optional for the encoder, and it’s legal to use more bits than strictly necessary to encode a piece of text.
It’s possible to use Extended Channel Interpretation mode segments to switch to national character encodings and save space. For example, text with many accented Latin characters can be encoded in fewer bytes in ISO-8859-1 than in UTF-8. However, the meanings of ECIs are outside the QR Code specification, using any character encoding other than UTF-8 is generally a bad idea, and supporting tens of different character encodings would greatly complicate my optimization algorithm.
The dynamic programming algorithm for optimal segmentation runs in linear time and linear space. It requires scanning the entire string once forward, once backward, and once more forward; hence it is not a streaming algorithm.

The algorithm

Basic info on how QR Code segments work:

Each segment has three fields: Mode (4 bits), character count (variable bits depending on mode and current version), and character data (variable bits depending on mode and character count).
For a given mode, the number of character count bits either stays the same or increases as the QR Code version increases. For example, a numeric mode segment’s character count field is 10 bits for versions 1 to 9, 12 bits for versions 10 to 26, and 14 bits for versions 27 to 40.
Numeric mode can encode the set of 10 characters 0123456789. It uses 3¹/₃ bits per character, rounded up to an integer. For example, a numeric mode segment containing 0 characters uses 0 bits; 1 char is 4 b; 2 char is 7 b; 3 char is 10 b; 4 char is 14 b.
Alphanumeric mode can encode the set of 45 characters 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ $%*+-./:. It uses 5¹/₂ bits per character, rounded up to an integer.
Byte mode is conventionally interpreted as UTF-8 text. Hence any Unicode code point can be encoded, and each CP uses between 1 and 4 bytes. The mode uses 8 bits per byte (obviously), which means 8 to 32 bits per CP.
Kanji mode can encode most of the characters defined in Shift JIS. The encodable set includes kanji used in Japan, hiragana, katakana, East Asian punctuation, full-width ASCII, Greek, Cyrillic, and miscellaneous, but not ordinary ASCII. The mode uses 13 bits per character.

Let s be a text string we want to segment optimally, and let n be the length of s in code points. Let v be the QR Code version we are using. First, define two mutually recursive functions:

Let f(i, m) be the minimum number of bits to encode the foremost i (0 ≤ i ≤ n) code points of s, such that the encoding ends with a segment of mode m. For example, f(2, kanji) represents the minimum bits to encode the length-2 prefix of s such that the last segment (which may have zero characters) is in kanji mode. Note that f is never infinity.
Let g(i, m) be the minimum number of bits to encode the foremost i+1 (0 ≤ i < n) code points of s, such that the last CP of the prefix (which is s[i]) is encoded in mode m. If s[i] cannot be encoded in mode m, then g(i, m) = ∞.
Note that both functions allow fractional bit counts. To ensure the numerator is always an integer, the smallest choice of denominator is 6. Numeric mode uses ²⁰/₆ bits per character; alphanumeric mode uses ³³/₆; byte mode uses ⁴⁸/₆ to ¹⁹²/₆; kanji mode uses ⁷⁸/₆.

Second, we reason about the base case. For each mode m, let f(0, m) = (bit length of empty segment of mode m) = 4 + (number of character count bits for a segment of mode m at version v).

Third, we reason about the inductive step:

For each index 0 ≤ i < n and each mode m, to compute g(i, m), we look at f(i, m) because it already ends in mode m, so we try to add one character at the end in mode m. If the code point s[i] is encodable in mode m, then let g(i, m) = f(i, m) + (bit length of one character in mode m). Otherwise let g(i, m) = ∞.
For each index 1 ≤ i ≤ n and each mode m, to compute f(i, m), we look at g(i−1, m′) for all possible modes m′. To switch from mode m′ to m, we need to finish the segment of mode m′, which means rounding up a fractional bit to a whole bit. Then we start a new segment of mode m, which incurs a bit cost. Hence, f(i, m) = min [ceiling(g(i−1, m′)) + (bit length of empty segment of mode m)] over all m′.

Furthermore, let h(i, m) be the mode m′ that was selected in the minimization to achieve the value of f(i, m). In other words, f(i, m) bits is achieved when the last character is encoded in mode h(i, m).

Now we have computed a table of bit costs, which is f(i, m) for 0 ≤ i ≤ n and for m in all 4 possible modes. Initialize a blank array c of length n to store the best mode needed to encode each code point.

Fourth, we work backward in the cost table to figure out the best mode for each code point index:

Find the mode m that minimizes f(n, m). This tells us the best mode be in after segmenting the whole text string.
Update m = h(n, m) so that m represents the best mode to encode the code point at index n−1, and store c[n−1] = m.
Update m = h(n−1, m), and store c[n−2] = m.
Repeat until we finish storing to c[0].

Finally, with the array c representing the best mode for each code point in the input string, we scan the array forward, starting a new segment whenever the mode changes, and putting each current code point into the current open segment.