Fast MD5 hash implementation in x86 assembly

For the fun of experimentation, I wanted to see how much I could optimize my x86 MD5 hash implementation for speed. I started with a fairly straightforward naive implementation, then reordered instructions and made equivalent logical transformations. Each successful optimization trick added a few MiB/s of speed, but after trying almost a hundred tweaks (of which about 20 succeeded), the overall result was a staggering 59% increase in speed.

Source code

The code comes in a number of parts:


To use this code, compile it on Linux with one of these commands:

Then run the executable with ./md5test.

Benchmark results

Code Compilation Speed on x86 Speed on x86-64
CGCC -O0122 MiB/s123 MiB/s
CGCC -O1379 MiB/s390 MiB/s
CGCC -O2387 MiB/s389 MiB/s
CGCC -O3387 MiB/s389 MiB/s
CGCC -O1 -fomit-frame-pointer382 MiB/s
CGCC -O2 -fomit-frame-pointer389 MiB/s
CGCC -O3 -fomit-frame-pointer390 MiB/s
Assembly (naive)GCC -O0270 MiB/s
Assembly (fast)GCC -O1430 MiB/s
Assembly (fast)GCC -O2427 MiB/s
Assembly (OpenSSL[0])GCC -O0410 MiB/s

On both CPU architectures, my assembly code is about 1.10× as fast as my C code best compiled by GCC. Moreover, the C code and assembly code compiled with the various options have the same speed on both architectures.

All the benchmark results above are based on: CPU = Intel Core 2 Quad Q6600 2.40 GHz (single-threaded), OS = Ubuntu 10.04 (32-bit and 64-bit), compiler = GCC 4.4.3.


