Fast MD5 hash implementation in x86 assembly
For the fun of experimentation, I wanted to see how much I could optimize my x86 MD5 hash implementation for speed. I started with a fairly straightforward naive implementation, then reordered instructions and made equivalent logical transformations. Each successful optimization trick added a few MiB/s of speed, but after trying almost a hundred tweaks (of which about 20 succeeded), the overall result was a staggering 59% increase in speed.
Source code
The code comes in a number of parts:
Three interchangeable versions of the MD5 compression function:
- In x86 assembly, naively (md5-naive-x86.S).
- In x86 assembly, after extensive optimization (md5-fast-x86.S).
- In C, for comparison with the x86 version (md5.c).
The full MD5 hash function (in C), which handles the block splitting and tail padding. This is not CPU-intensive, because its job is only to set up the appropriate data for the compression function to process.
A runnable main program that checks correctness and performs a speed benchmark.
Files:
To use this code, compile it on Linux with one of these commands:
cc md5-test.c md5.c -o md5-test
(x86, x86-64)cc md5-test.c md5-naive-x86.S -o md5-test
(x86 only)cc md5-test.c md5-fast-x86.S -o md5-test
(x86 only)cc md5-test.c md5-fast-x8664.S -o md5-test
(x86-64 only)
Then run the executable with ./md5-test
.
Benchmark results
Code | Compilation | Speed on x86 | Speed on x86-64 |
---|---|---|---|
C | GCC -O0 | 122 MiB/s | 123 MiB/s |
C | GCC -O1 | 379 MiB/s | 390 MiB/s |
C | GCC -O2 | 387 MiB/s | 389 MiB/s |
C | GCC -O3 | 387 MiB/s | 389 MiB/s |
C | GCC -O1 -fomit-frame-pointer | 382 MiB/s | |
C | GCC -O2 -fomit-frame-pointer | 389 MiB/s | |
C | GCC -O3 -fomit-frame-pointer | 390 MiB/s | |
Assembly (naive) | GCC -O0 | 270 MiB/s | |
Assembly (fast) | GCC -O1 | 430 MiB/s | |
Assembly (fast) | GCC -O2 | 427 MiB/s | |
Assembly (OpenSSL[0]) | GCC -O0 | 410 MiB/s |
On both CPU architectures, my assembly code is about 1.10× as fast as my C code best compiled by GCC. Moreover, the C code and assembly code compiled with the various options have the same speed on both architectures.
All the benchmark results above are based on: CPU = Intel Core 2 Quad Q6600 2.40 GHz (single-threaded), OS = Ubuntu 10.04 (32-bit and 64-bit), compiler = GCC 4.4.3.
Notes
My initial naive code was already loop-unrolled. Since my goal was to optimize for speed, it wouldn’t make sense to start with an implementation that still had loop management overhead.
I was surprised to see that my naive x86 code ran much slower than GCC’s generated code, and that my final hand-optimized code was not too much faster than GCC’s best code.
Sometimes adding extra instructions and doing more arithmetic computations results in faster code. I think this worked because I made the data dependency graph shallower and took advantage of the superscalar processor.
When there are multiple instruction streams doing independent computations, trying the different ways of interleaving them is very important in extracting performance from the superscalar processor.
Abusing the LEA instruction to add two registers and a constant is a classic optimization trick with predictable improvement.
I had just enough registers to store all the relevant values (4 for the MD5 state, 2 for intermediate computations per round, 1 for the base of the block’s array, and 1 for the stack), which definitely made the code shorter and faster. However, I didn’t use the EBP register in the conventional way.
Both my x86 and C implementations of MD5 use C preprocessor macros to generate the repetitive code.
x86-64 code: All the C files work correctly without modification on x86-64. I made minimal changes to the assembly code only to adapt to the calling convention and change the 32-bit constants to signed numbers. The usage instructions are exactly the same.
MD5 is an old hash algorithm with severe cryptanalytic attacks, and it should not be used anymore for any serious purpose. Please use SHA-{256,512,512/256,3} instead.
[0]: OpenSSL 1.0.1c, generating md5-586.s on Linux.