Project Nayuki


Fast SHA-1 hash implementation in x86 assembly

After doing something for the first time, doing it again is much easier. The MD5 and SHA-1 hash algorithms are quite similar in structure, and having written MD5 in x86 assembly recently, applying that knowledge to implement SHA-1 in x86 was a breeze.

Source code

The code comes in a number of parts:

  • Four interchangeable versions of the SHA-1 compression function:
    • In x86 assembly, naively (sha1-naive-x86.S).
    • In x86 assembly, with reordered key schedule computation and instruction-level optimization (sha1-fast-x86.S).
    • In C, naively in the manner described in the official specification.
    • In C, with the key schedule computation interleaved with the hashing rounds (much like sha1-fast-x86.S).
  • The full SHA-1 hash function (in C), which handles the block splitting and tail padding. This is not CPU-intensive, because its job is only to set up the appropriate data for the compression function to process.
  • A runnable main program that checks correctness and performs a speed benchmark.

Files:

To use this code, compile it on Linux with one of these commands:

  • cc sha1-test.c sha1-naive.c -o sha1-test (x86, x86-64)
  • cc sha1-test.c sha1-fast.c -o sha1-test (x86, x86-64)
  • cc sha1-test.c sha1-naive-x86.S -o sha1-test (x86 only)
  • cc sha1-test.c sha1-fast-x86.S -o sha1-test (x86 only)
  • cc sha1-test.c sha1-fast-x8664.S -o sha1-test (x86-64 only)

Then run the executable with ./sha1-test.

Benchmark results

Code Compilation Speed on x86 Speed on x86-64
C (naive)GCC -O098 MiB/s97 MiB/s
C (naive)GCC -O1184 MiB/s212 MiB/s
C (naive)GCC -O2178 MiB/s198 MiB/s
C (naive)GCC -O3177 MiB/s199 MiB/s
C (naive)GCC -O1 -fomit-frame-pointer206 MiB/s
C (naive)GCC -O2 -fomit-frame-pointer191 MiB/s
C (naive)GCC -O3 -fomit-frame-pointer191 MiB/s
C (fast)GCC -O0111 MiB/s111 MiB/s
C (fast)GCC -O1253 MiB/s292 MiB/s
C (fast)GCC -O2137 MiB/s191 MiB/s
C (fast)GCC -O3137 MiB/s191 MiB/s
C (fast)GCC -O1 -fomit-frame-pointer269 MiB/s
C (fast)GCC -O2 -fomit-frame-pointer182 MiB/s
C (fast)GCC -O3 -fomit-frame-pointer182 MiB/s
Assembly (naive)GCC -O1270 MiB/s
Assembly (fast)GCC -O0313 MiB/s
Assembly (fast)GCC -O1327 MiB/s
Assembly (OpenSSL[0])GCC -O0305 MiB/s

On x86, my assembly code is 1.22× as fast as my C code best compiled by GCC. On x86-64, my assembly code is 1.07× as fast as my C code best compiled by GCC.

All the benchmark results above are based on: CPU = Intel Core 2 Quad Q6600 2.40 GHz (single-threaded), OS = Ubuntu 10.04 (32-bit and 64-bit), compiler = GCC 4.4.3.

Notes

  • In both MD5 and SHA-1 (and some other hashes), the round function computes (b & c) | (~b & d). There are different ways to compute this that give the same result, but the most speed-optimized way is different for MD5 compared to SHA-1.

  • Once again by great luck, there are just enough registers to keep all the state values, temporary values, and data pointers in registers without spilling.

  • Because of the key schedule computation, SHA-1 does some more computation and many more load/store operations than MD5, yet it doesn’t run much slower than MD5.

  • Going beyond GCC -O1 may actually hurt performance. Speed optimization involves a surprising amount of voodoo.

  • All the C files work correctly without modification on x86-64. I made minimal changes to the assembly code only to adapt to the calling convention and change the 32-bit constants to signed numbers. The usage instructions are exactly the same.

  • [0]: OpenSSL 1.0.1c, generating sha1-586.s on Linux.

More info