Project Nayuki

Fast Whirlpool hash in x86 assembly

Implementing the Whirlpool hash function at all can be challenging. Implementing it efficiently is even more challenging. The design of Whirlpool is very similar to that of the AES cipher, involving byte-wise operations, an S-box, linear algebra, and Galois field arithmetic. So unlike cryptographic functions like MD5 that have a straightforward description, it takes some additional mathematical knowledge to understand how to implement Whirlpool.

Optimizing Whirlpool requires familiarity with the basic algorithm, so in the interest of brevity I will assume that you have read and understood the Whirlpool specification paper. (If you haven’t, you should read about AES first, because the tutorials for AES are much friendlier. Then you’d adapt this knowledge in order to understand Whirlpool.)

The key to optimizing Whirlpool is to operate on not one byte at a time, but an entire 8-byte row. (Similarly, AES benefits from the same optimization as applied to 4-byte columns.) The three operations SubBytes, ShiftColumns, and MixRows can be combined into a single operation that loads a byte from the appropriately shifted location and XORs the appropriate row with a magic constant that fuses the effects of SubBytes and MixRows.

Source code

The code comes in a number of parts:


To use this code, compile it on Linux with one of these commands:

Then run the executable with ./whirlpool-test.

Benchmark results

Code Compilation Speed on x86 Speed on x86-64
C (byte)GCC -O00.76 MiB/s0.76 MiB/s
C (byte)GCC -O12.78 MiB/s2.40 MiB/s
C (byte)GCC -O22.93 MiB/s2.44 MiB/s
C (byte)GCC -O35.58 MiB/s8.10 MiB/s
C (byte)GCC -Os1.49 MiB/s2.30 MiB/s
C (byte)GCC -O3 -fomit-frame-pointer5.87 MiB/s
C (int64)GCC -O06.1 MiB/s11.3 MiB/s
C (int64)GCC -O113.7 MiB/s44.5 MiB/s
C (int64)GCC -O217.0 MiB/s46.6 MiB/s
C (int64)GCC -O318.0 MiB/s47.3 MiB/s
C (int64)GCC -Os16.6 MiB/s46.6 MiB/s
C (int64)GCC -O3 -fomit-frame-pointer17.0 MiB/s
AssemblyGCC -O059.4 MiB/s63.1 MiB/s
AssemblyGCC -O159.6 MiB/s63.1 MiB/s
AssemblyGCC -O259.7 MiB/s62.8 MiB/s
AssemblyGCC -O359.0 MiB/s62.8 MiB/s
AssemblyGCC -Os59.0 MiB/s62.7 MiB/s

On x86, my assembly code is 3.32× as fast as my C code best compiled by GCC. On x86-64, my assembly code is 1.33× as fast as my C code best compiled by GCC.

All the benchmark results above are based on: CPU = Intel Core 2 Quad Q6600 2.40 GHz (single-threaded), OS = Ubuntu 10.04 (32-bit and 64-bit), compiler = GCC 4.4.3.


More info