Project Nayuki

Fast MD5 hash implementation in x86 assembly

For the fun of experimentation, I wanted to see how much I could optimize my x86 MD5 hash implementation for speed. I started with a fairly straightforward naive implementation, then reordered instructions and made equivalent logical transformations. Each successful optimization trick added a few MiB/s of speed, but after trying almost a hundred tweaks (of which about 20 succeeded), the overall result was a staggering 59% increase in speed.

Source code

The code comes in a number of parts:

  • Three interchangeable versions of the MD5 compression function:
    • In x86 assembly, naively (md5-naive.S).
    • In x86 assembly, after extensive optimization (md5-fast.S).
    • In C, for comparison with the x86 version (md5.c).
  • The full MD5 hash function (in C), which handles the block splitting and tail padding. This is not CPU-intensive, because its job is only to set up the appropriate data for the compression function to process.
  • A runnable main program that checks correctness and performs a speed benchmark.


To use this code, compile it on Linux with one of these commands:

  • gcc md5test.c md5.c -o md5test (x86, x86-64)
  • gcc md5test.c md5-naive.S -o md5test (x86 only)
  • gcc md5test.c md5-fast.S -o md5test (x86 only)
  • gcc md5test.c md5-fast-64.S -o md5test (x86-64 only)

Then run the executable with ./md5test.

Benchmark results

Code Compilation Speed on x86 Speed on x86-64
CGCC -O0122 MiB/s123 MiB/s
CGCC -O1379 MiB/s390 MiB/s
CGCC -O2387 MiB/s389 MiB/s
CGCC -O3387 MiB/s389 MiB/s
CGCC -O1 -fomit-frame-pointer382 MiB/s
CGCC -O2 -fomit-frame-pointer389 MiB/s
CGCC -O3 -fomit-frame-pointer390 MiB/s
Assembly (naive)GCC -O0270 MiB/s
Assembly (fast)GCC -O1430 MiB/s
Assembly (fast)GCC -O2427 MiB/s
Assembly (OpenSSL[0])GCC -O0410 MiB/s

On both CPU architectures, my assembly code is about 1.10× as fast as my C code best compiled by GCC. Moreover, the C code and assembly code compiled with the various options have the same speed on both architectures.

All the benchmark results above are based on: CPU = Intel Core 2 Quad Q6600 2.40 GHz (single-threaded), OS = Ubuntu 10.04 (32-bit and 64-bit), compiler = GCC 4.4.3.


  • My initial naive code was already loop-unrolled. Since my goal was to optimize for speed, it wouldn’t make sense to start with an implementation that still had loop management overhead.

  • I was surprised to see that my naive x86 code ran much slower than GCC’s generated code, and that my final hand-optimized code was not too much faster than GCC’s best code.

  • Sometimes adding extra instructions and doing more arithmetic computations results in faster code. I think this worked because I made the data dependency graph shallower and took advantage of the superscalar processor.

  • When there are multiple instruction streams doing independent computations, trying the different ways of interleaving them is very important in extracting performance from the superscalar processor.

  • Abusing the LEA instruction to add two registers and a constant is a classic optimization trick with predictable improvement.

  • I had just enough registers to store all the relevant values (4 for the MD5 state, 2 for intermediate computations per round, 1 for the base of the block’s array, and 1 for the stack), which definitely made the code shorter and faster. However, I didn’t use the EBP register in the conventional way.

  • Both my x86 and C implementations of MD5 use C preprocessor macros to generate the repetitive code.

  • x86-64 code: All the C files work correctly without modification on x86-64. I made minimal changes to the assembly code only to adapt to the calling convention and change the 32-bit constants to signed numbers. The usage instructions are exactly the same.

  • MD5 is an old hash algorithm with severe cryptanalytic attacks, and it should not be used anymore for any serious purpose. Please use SHA-1/256/512/3 instead.

  • [0]: OpenSSL 1.0.1c, generating md5-586.s on Linux.

More info