Project Nayuki


Fast SHA-2 hashes in x86 assembly

Using what I learned from implementing other hash algorithms in x86 assembly, writing the SHA-2 algorithms (specifically SHA-256 and SHA-512) in x86 and x86-64 assembly languages was very straightforward and involved a predictable amount of effort.

Source code

Files:

sha224test.c and sha256test.c require a SHA-256 compression function, which is provided by either sha256.c or sha256.S or sha256-64.S. Similarly, sha384test.c and sha512test.c require a SHA-512 compression function, which is provided by either sha512.c or sha512.S or sha512-64.S.

To use this code, compile it on Linux with one of these commands:

Then run the executable with ./sha224test, ./sha256test, ./sha384test, or ./sha512test.

Benchmark results

SHA-256 (also SHA-224)

Code Compilation Speed on x86 Speed on x86-64
CGCC -O072.4 MiB/s71.7 MiB/s
CGCC -O197.4 MiB/s118.4 MiB/s
CGCC -O294.4 MiB/s115.9 MiB/s
CGCC -O394.4 MiB/s115.7 MiB/s
CGCC -Os97.4 MiB/s116.4 MiB/s
CGCC -O1 -march=native95.4 MiB/s118.5 MiB/s
CGCC -O2 -march=native94.4 MiB/s116.0 MiB/s
CGCC -O3 -march=native94.4 MiB/s116.1 MiB/s
CGCC -Os -march=native97.9 MiB/s114.7 MiB/s
CGCC -O1 -fomit-frame-pointer101.7 MiB/s
CGCC -O2 -fomit-frame-pointer99.8 MiB/s
CGCC -O3 -fomit-frame-pointer99.8 MiB/s
CGCC -Os -fomit-frame-pointer98.4 MiB/s
CGCC -O1 -fomit-frame-pointer -march=native99.5 MiB/s
CGCC -O2 -fomit-frame-pointer -march=native97.4 MiB/s
CGCC -O3 -fomit-frame-pointer -march=native97.4 MiB/s
CGCC -Os -fomit-frame-pointer -march=native99.5 MiB/s
AssemblyGCC -O0112.7 MiB/s132.9 MiB/s
AssemblyGCC -O1112.3 MiB/s133.3 MiB/s
AssemblyGCC -Os113.0 MiB/s133.0 MiB/s

On x86, my code is 1.11× as fast as my C code best compiled by GCC. On x86-64, my code is 1.12× as fast as my C code best compiled by GCC.

SHA-512 (also SHA-384)

Code Compilation Speed on x86 Speed on x86-64
CGCC -O014.9 MiB/s107.4 MiB/s
CGCC -O126.2 MiB/s173.6 MiB/s
CGCC -O226.3 MiB/s172.5 MiB/s
CGCC -O326.4 MiB/s172.6 MiB/s
CGCC -Os26.7 MiB/s171.6 MiB/s
CGCC -O1 -march=native26.3 MiB/s172.1 MiB/s
CGCC -O2 -march=native26.1 MiB/s171.1 MiB/s
CGCC -O3 -march=native26.1 MiB/s170.9 MiB/s
CGCC -Os -march=native26.6 MiB/s170.6 MiB/s
CGCC -O1 -fomit-frame-pointer24.8 MiB/s
CGCC -O2 -fomit-frame-pointer24.2 MiB/s
CGCC -O3 -fomit-frame-pointer24.2 MiB/s
CGCC -Os -fomit-frame-pointer24.2 MiB/s
CGCC -O1 -fomit-frame-pointer -march=native24.8 MiB/s
CGCC -O2 -fomit-frame-pointer -march=native24.9 MiB/s
CGCC -O3 -fomit-frame-pointer -march=native24.9 MiB/s
CGCC -Os -fomit-frame-pointer -march=native24.2 MiB/s
AssemblyGCC -O0114.8 MiB/s206.5 MiB/s
AssemblyGCC -O1114.4 MiB/s205.5 MiB/s
AssemblyGCC -Os115.2 MiB/s207.6 MiB/s

On x86, my code is 4.31× as fast as my C code best compiled by GCC. On x86-64, my code is 1.20× as fast as my C code best compiled by GCC.

Overall, the wins from hand-writing assembly code are small for SHA-256 and for x86-64, but the gain is huge for SHA-512 on x86.

All the benchmark results above are based on: CPU = Intel Core 2 Quad Q6600 2.40 GHz (single-threaded), OS = Ubuntu 10.04 (32-bit and 64-bit), compiler = GCC 4.4.3.

Remarks

Regarding the x86-64 code:

Regarding the x86 code: