Fast SHA-2 hashes in x86 assembly

Using what I learned from implementing other hash algorithms in x86 assembly, writing the SHA-2 algorithms (specifically SHA-256 and SHA-512) in x86 and x86-64 assembly languages was very straightforward and involved a predictable amount of effort.

Source code

Files:

sha224-test.c and sha256-test.c require a SHA-256 compression function, which is provided by either sha256.c or sha256-x86.S or sha256-x8664.S. Similarly, sha384-test.c and sha512-test.c require a SHA-512 compression function, which is provided by either sha512.c or sha512-x86.S or sha512-x8664.S.

To use this code, compile it on Linux with one of these commands:

cc sha224-test.c sha256.c -o sha224-test (any platform)
cc sha224-test.c sha256-x86.S -o sha224-test (x86)
cc sha224-test.c sha256-x8664.S -o sha224-test (x86-64)
cc sha256-test.c sha256.c -o sha256-test (any platform)
cc sha256-test.c sha256-x86.S -o sha256-test (x86)
cc sha256-test.c sha256-x8664.S -o sha256-test (x86-64)
cc sha384-test.c sha512.c -o sha384-test (any platform)
cc sha384-test.c sha512-x86.S -o sha384-test (x86)
cc sha384-test.c sha512-x8664.S -o sha384-test (x86-64)
cc sha512-test.c sha512.c -o sha512-test (any platform)
cc sha512-test.c sha512-x86.S -o sha512-test (x86)
cc sha512-test.c sha512-x8664.S -o sha512-test (x86-64)

Then run the executable with ./sha224-test, ./sha256-test, ./sha384-test, or ./sha512-test.

Benchmark results

SHA-256 (also SHA-224)

Code	Compilation	Speed on x86	Speed on x86-64
C	GCC -O0	72.4 MiB/s	71.7 MiB/s
C	GCC -O1	97.4 MiB/s	118.4 MiB/s
C	GCC -O2	94.4 MiB/s	115.9 MiB/s
C	GCC -O3	94.4 MiB/s	115.7 MiB/s
C	GCC -Os	97.4 MiB/s	116.4 MiB/s
C	GCC -O1 -march=native	95.4 MiB/s	118.5 MiB/s
C	GCC -O2 -march=native	94.4 MiB/s	116.0 MiB/s
C	GCC -O3 -march=native	94.4 MiB/s	116.1 MiB/s
C	GCC -Os -march=native	97.9 MiB/s	114.7 MiB/s
C	GCC -O1 -fomit-frame-pointer	101.7 MiB/s
C	GCC -O2 -fomit-frame-pointer	99.8 MiB/s
C	GCC -O3 -fomit-frame-pointer	99.8 MiB/s
C	GCC -Os -fomit-frame-pointer	98.4 MiB/s
C	GCC -O1 -fomit-frame-pointer -march=native	99.5 MiB/s
C	GCC -O2 -fomit-frame-pointer -march=native	97.4 MiB/s
C	GCC -O3 -fomit-frame-pointer -march=native	97.4 MiB/s
C	GCC -Os -fomit-frame-pointer -march=native	99.5 MiB/s
Assembly	GCC -O0	112.7 MiB/s	132.9 MiB/s
Assembly	GCC -O1	112.3 MiB/s	133.3 MiB/s
Assembly	GCC -Os	113.0 MiB/s	133.0 MiB/s

On x86, my code is 1.11× as fast as my C code best compiled by GCC. On x86-64, my code is 1.12× as fast as my C code best compiled by GCC.

SHA-512 (also SHA-384)

Code	Compilation	Speed on x86	Speed on x86-64
C	GCC -O0	14.9 MiB/s	107.4 MiB/s
C	GCC -O1	26.2 MiB/s	173.6 MiB/s
C	GCC -O2	26.3 MiB/s	172.5 MiB/s
C	GCC -O3	26.4 MiB/s	172.6 MiB/s
C	GCC -Os	26.7 MiB/s	171.6 MiB/s
C	GCC -O1 -march=native	26.3 MiB/s	172.1 MiB/s
C	GCC -O2 -march=native	26.1 MiB/s	171.1 MiB/s
C	GCC -O3 -march=native	26.1 MiB/s	170.9 MiB/s
C	GCC -Os -march=native	26.6 MiB/s	170.6 MiB/s
C	GCC -O1 -fomit-frame-pointer	24.8 MiB/s
C	GCC -O2 -fomit-frame-pointer	24.2 MiB/s
C	GCC -O3 -fomit-frame-pointer	24.2 MiB/s
C	GCC -Os -fomit-frame-pointer	24.2 MiB/s
C	GCC -O1 -fomit-frame-pointer -march=native	24.8 MiB/s
C	GCC -O2 -fomit-frame-pointer -march=native	24.9 MiB/s
C	GCC -O3 -fomit-frame-pointer -march=native	24.9 MiB/s
C	GCC -Os -fomit-frame-pointer -march=native	24.2 MiB/s
Assembly	GCC -O0	114.8 MiB/s	206.5 MiB/s
Assembly	GCC -O1	114.4 MiB/s	205.5 MiB/s
Assembly	GCC -Os	115.2 MiB/s	207.6 MiB/s

On x86, my code is 4.31× as fast as my C code best compiled by GCC. On x86-64, my code is 1.20× as fast as my C code best compiled by GCC.

Overall, the wins from hand-writing assembly code are small for SHA-256 and for x86-64, but the gain is huge for SHA-512 on x86.

All the benchmark results above are based on: CPU = Intel Core 2 Quad Q6600 2.40 GHz (single-threaded), OS = Ubuntu 10.04 (32-bit and 64-bit), compiler = GCC 4.4.3.

Remarks

Regarding the x86-64 code:

Overall, the task of implementing these hash functions in x86-64 was quite easy for the reasons described below. SHA-256 (x86-64) was written first, and the other 3 assembly language programs were derived by tweaking this. Compare this to the extra effort spent in the x86 version.
Both hash functions have nearly identical structures as per the specification (except the bit width, shift/rotation amounts, and number of rounds). Because x86-64 has instructions for operating on 32-bit values and on 64-bit values, the SHA-256 code was easily adapted to SHA-512 by widening the registers and operations.
There is only one subtle difference between the SHA-256 code and SHA-512 code. To add the round constant, SHA-256 simply uses a 32-bit immediate displacement in an LEA instruction. But for the 64-bit round constants in SHA-512, they can’t be used in a displacement, and can only be used as an operand of MOV. So this introduced an extra step of MOVing the constant into a temporary register followed by performing an ADD.
Register allocation was a breeze and there was no pressure whatsoever. 8 registers were used for the state variables, 4 for temporary arithmetic, 2 for argument array addresses, 1 for the stack, and 1 spare.

Regarding the x86 code:

For SHA-256, there were barely enough registers to hold the temporary results of calculations and to cache some values loaded from memory (to avoid loading more than once), so all 8 state variables had to be stored on the stack. Other than that, the SHA-256 code has no registers spills. The code is essentially the same as the x86-64 version, except with a different calling convention and with the state variables being stored on the stack.
For SHA-512, the code was rewritten to use 64-bit MMX registers and SSE2 64-bit integer operations such as PADDQ and PSLLQ, as well as occasionally using XMM registers and 128-bit operations. It uses very few general-purpose registers, eliminating the need to touch callee-save registers.

The earlier published version of the code used only 32-bit operations on general-purpose integer registers and expanded the entire message schedule ahead of the rounds (instead of generating the schedule on the fly in a circular buffer). As a result, it ran at about 60% of the speed of the current code. The 64-bit variables were loaded and stored as two 32-bit words, and the 64-bit arithmetic had to be emulated using multiple 32-bit operations. The bitwise operations AND, OR, and XOR were easy to implement, and only slightly harder were addition, bit shift, and bit rotation. The ADC, SHLD, and SHRD instructions were extremely convenient for this emulation, and I’m especially thankful for SHLD and SHRD being included in the instruction set.