Fast Whirlpool hash in x86 assembly

Implementing the Whirlpool hash function at all can be challenging. Implementing it efficiently is even more challenging. The design of Whirlpool is very similar to that of the AES cipher, involving byte-wise operations, an S-box, linear algebra, and Galois field arithmetic. So unlike cryptographic functions like MD5 that have a straightforward description, it takes some additional mathematical knowledge to understand how to implement Whirlpool.

Optimizing Whirlpool requires familiarity with the basic algorithm, so in the interest of brevity I will assume that you have read and understood the Whirlpool specification paper. (If you haven’t, you should read about AES first, because the tutorials for AES are much friendlier. Then you’d adapt this knowledge in order to understand Whirlpool.)

The key to optimizing Whirlpool is to operate on not one byte at a time, but an entire 8-byte row. (Similarly, AES benefits from the same optimization as applied to 4-byte columns.) The three operations SubBytes, ShiftColumns, and MixRows can be combined into a single operation that loads a byte from the appropriately shifted location and XORs the appropriate row with a magic constant that fuses the effects of SubBytes and MixRows.

Source code

The code comes in a number of parts:

Three interchangeable versions of the Whirlpool compression function:
- In x86 assembly, optimized using MMX and SSE.
- In C, a naive byte-wise reference implementation.
- In C, a better int64-based implementation.
The full Whirlpool hash function (in C), which handles the block splitting and tail padding. This is not CPU-intensive, because its job is only to set up the appropriate data for the compression function to process.
A runnable main program that checks correctness and performs a speed benchmark.

Files:

To use this code, compile it on Linux with one of these commands:

cc whirlpool-test.c whirlpool-byte.c -o whirlpool-test (x86, x86-64)
cc whirlpool-test.c whirlpool-int64.c -o whirlpool-test (x86, x86-64)
cc whirlpool-test.c whirlpool-x86.S -o whirlpool-test (x86 only)
cc whirlpool-test.c whirlpool-x8664.S -o whirlpool-test (x86-64 only)

Then run the executable with ./whirlpool-test.

Benchmark results

Code	Compilation	Speed on x86	Speed on x86-64
C (byte)	GCC -O0	0.76 MiB/s	0.76 MiB/s
C (byte)	GCC -O1	2.78 MiB/s	2.40 MiB/s
C (byte)	GCC -O2	2.93 MiB/s	2.44 MiB/s
C (byte)	GCC -O3	5.58 MiB/s	8.10 MiB/s
C (byte)	GCC -Os	1.49 MiB/s	2.30 MiB/s
C (byte)	GCC -O3 -fomit-frame-pointer	5.87 MiB/s
C (int64)	GCC -O0	6.1 MiB/s	11.3 MiB/s
C (int64)	GCC -O1	13.7 MiB/s	44.5 MiB/s
C (int64)	GCC -O2	17.0 MiB/s	46.6 MiB/s
C (int64)	GCC -O3	18.0 MiB/s	47.3 MiB/s
C (int64)	GCC -Os	16.6 MiB/s	46.6 MiB/s
C (int64)	GCC -O3 -fomit-frame-pointer	17.0 MiB/s
Assembly	GCC -O0	59.4 MiB/s	63.1 MiB/s
Assembly	GCC -O1	59.6 MiB/s	63.1 MiB/s
Assembly	GCC -O2	59.7 MiB/s	62.8 MiB/s
Assembly	GCC -O3	59.0 MiB/s	62.8 MiB/s
Assembly	GCC -Os	59.0 MiB/s	62.7 MiB/s

On x86, my assembly code is 3.32× as fast as my C code best compiled by GCC. On x86-64, my assembly code is 1.33× as fast as my C code best compiled by GCC.

All the benchmark results above are based on: CPU = Intel Core 2 Quad Q6600 2.40 GHz (single-threaded), OS = Ubuntu 10.04 (32-bit and 64-bit), compiler = GCC 4.4.3.

Notes

The x86 code uses all eight 64-bit MMX and all eight 128-bit XMM registers, but only 5 general-purpose registers (GPRs).
The x86-64 code changes the use of MMX registers to GPRs r8 to r15, thus reducing conceptual complexity.
PEXTRB is a more suitable instruction to use, but it’s fairly new, being introduced in SSE 4.1. So I emulated it using the older PEXTRW instruction with pre- and post-adjustments.
AddRoundKey is performed on 128 bits per operation, which is a nice benefit of using the XMM registers. All the other Whirlpool operations can only be done 64 bits at a time.
While the software implementation of Whirlpool is much slower than (e.g.) SHA-1, Whirlpool will see a much larger parallelization speedup when implemented in logic gates and circuits.

Project Nayuki

Fast Whirlpool hash in x86 assembly

Source code

Benchmark results

Notes

More info