Project Nayuki

Huffman-coding English words


We know that some letters occur more frequently than others in English text. E and T are the two most common letters. A typical homework exercise for computer science students is to write a program that creates an optimal Huffman code for the English alphabet, and try to encode some texts with it to observe the compression achieved.

Let’s take it a step further and consider English words instead of individual letters. In a typical block of text, some words – like the, and, is – occur far more often than others. What would happen if we replaced common words with short letter sequences and uncommon words with longer sequences – what would the result look like? Consider this as an experiment in increasing the information entropy of natural language text...

Sample text

For these encoding experiments, the full text of the book Alice’s Adventures in Wonderland by Lewis Carroll was used. The text was preprocessed to remove the hard line wrapping newlines within paragraphs because the encoding schemes described here will change the word lengths and thus the line lengths. When you view the full text files, you will need a text editor that supports soft line wrapping.

Full text files:

Paragraph samples: (Tip: hover over an encoded word to see the original word)

Original Sort coding Huffman coding
CHAPTER IV. The Rabbit Sends in a Little Bill MX aIV. B Ec aSends l e Bk Je TzvHR. AhEhBAFKWAHdRCMIPPVF QJ
'Hand it over here,' said the Dodo. 'Ie g fa eb,' k b Ma. 'CTNKEshEPGI,' cDAND.
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?' M n ke d em bg wr f pv ci q rg v b bkt, c f pg fo d bt: fp bd bdd h z bqz cv b nq q rg n bii, ba g z bs bhx bd bxt l g, 'c bh bi b ga f e nq,' cm M 'fq bhx bd bxt?' xlTFOCXAdgaUXkBUAGqbAXDAUT, JUZAGuOGk: HEAZqN YAJAQVBaDAERAGqblLFNWIKi, AchAJBTACFAZAwW dAK, 'JBIBGDGAURji,' Bex'EGACFAZAwW?'
Alice was not a bit hurt, and she jumped up on to her feet in a moment: she looked up, but it was all dark overhead; before her was another long passage, and the White Rabbit was still in sight, hurrying down it. There was not a moment to be lost: away went Alice like the wind, and was just in time to hear it say, as it turned a corner, 'Oh my ears and whiskers, how late it's getting!' She was close behind it when she turned the corner, but the Rabbit was no longer to be seen: she found herself in a long, low hall, which was lit up by a row of lamps hanging from the roof. M n bc e jo bop, c h xx bo v d q ip l e fy: h eo bo, ba g n w bmb aoverhead; ex q n hq fs bht, c b Gh Ec n mo l pu, ahurrying bq g. Br n bc e fy d bb bpo: gp cf M cb b cmc, c n du l cr d kl g ed, p g ka e bes, 'Ep dh bap c buf, cq xy g't hs!' H n lx lv g ch h ka b bes, ba b Ec n bs bpn d bb ks: h fl ce l e fs, ko qo, dn n alit bo ci e chk f alamps boh ea b yq. xlAeRgEpfLP, JYrNBKAXOAGXQdRXv: YCOHr, AchlAMATWSAjFLpFNW; HHAGlHAEVACJ, JDGmEhlTXdaQ, pfLLAAIKi BRAK. EtlAeRGvOAYART: HjBPxBSDAmN, JlCQdCKOQQhHv, shHsRADz, 'jxBpADlJANT, CVARiAK'AHiFPPIKi!' EAlTnTohBVYHsDADz, AcDEhlBTAsOOAYAPn: YELCedRAsQ, gzZp, CJlVIAFBKBURApgUVNtkAHASGCSDLSSAI.

Codebook sample for sort coding:

b the
c and
d to
e a
f of
g it
h she
i i
j you
k said
l in
m alice
n was
o that
fw tell
fx hare
fy moment
fz put
ga use
gb d
cmi wretched
cmj yawned
cmk yawning
cml year
cmm years

Codebook sample for Huffman coding:

AA “y”
ABA “too”
ABB “tongue”
ABC “tm”
ABD “tiny ”
AF “t ”
AG “her ”
AH “s ”
AI “f”
AJ “had ”
AK “it”
AL “Alice”
AM “all ”
Azx “Knave”
Azy “Keep ”
Azz “J”
BA “S”
BB “out ”
BC “he ”
BDA “Its ”
BDD “Hush”
BDE “Hart”
BI “what ”
BJ “you”
BK “up ”
BL “Project ”
BM “were ”
BN “if ”
BO “one ”
BP “went ”
BQ “have ”
BR “down ”
BS “like ”
BT “no ”
I “i”
J “and ”
K “n”
L “r”
M “s”
N “a”
O “to ”
P “t”
uy “tree ”
uz “tree”
v “I”
w “that ”
x “Alice ”
y “at ”
z “with ”


First, let’s clarify how the input text is interpreted. We assume that it is in plain text format (no markup, HTML, etc.). We define the input text to be a sequence of tokens, where each token is either a single symbol (such as space, punctuation, number, newline) or a word (a sequence of consecutive uppercase/lowercase letters). We also disallow two word tokens being adjacent (e.g. we can’t treat tome as the two consecutive words to me with no symbols in between).

Sort coding

We start by illustrating a simpler algorithm before considering Huffman coding. If we encode each input word into some other word, then word boundaries are preserved by the symbols and each encoded word is still easily decodable. In particular, we don’t need to make use of the prefix decodability property of Huffman codes.

Therefore, we can devise a simple coding scheme based on sorting:

  1. Count the number of occurrences of each word in the text.

  2. Assign a codeword to each unique word, starting with the most frequent word and working downward in frequency. The most frequent word is given the codeword a, the next word is given b, next c, ..., z, aa, ab, ..., az, ba, bb, ..., zz, baa, etc.

  3. Write the word-codeword mapping to the output. (Otherwise decoding is impossible.)

  4. Iterate through the input tokens, mapping word tokens to their codeword and leaving symbol tokens unchanged, and write the token strings to the output file.

However, there are a few additional details I chose to implement:

Huffman coding

Now we come to Huffman coding, the main part of the experiment. We first observe that sort coding is essentially optimal; we need to change the operational model before Huffman coding becomes useful. In particular, we want to take advantage of the prefix-free property – in a Huffman-coded text, we don’t need spaces between words because the codewords are self-delimiting!

Therefore, let’s consider a word followed by a space to be a single word for the purposes of encoding. This is the most common word+symbol combination; others like word+comma or word+apostrophe exist but we’ll disregard them for simplicity. Note that word+space is a different word than just the word itself. For example, assume the codebook is {“the” = xa, “the ” = k, “quick” = oql}. Then we can encode the text “the quick” as either xa oql (the space quick) or koql (the+space quick).

The encoding procedure looks like this:

  1. Count the number of occurrences of each word in the text. We consider a word followed by space to be a single word.

  2. Construct a Huffman code tree for the set of words and frequencies.

  3. Write the word-codeword mapping to the output. (Otherwise decoding is impossible.)

  4. Encode the input text tokens into tokens for the output text. Note that some of the space tokens in the input will collapse into the preceding word.

Note that symbols other than space are not Huffman-coded at all. This is a deliberate design so that the output text looks readable and can be meaningful compared to the input text.

Again, I chose to make some modifications to the basic algorithm to improve the efficiency of the coding, at a cost to complexity:

Source code