Lowest SHA-512 value by brute force

Lowest SHA-512 program running screenshot

I encountered a competition to find an ASCII string whose SHA-512 hash value is as low as possible (which roughly speaking means having many leading zeros). So I accepted the challenge and built a brute-force search program using C code, x86-64 assembly, AVX vector instructions, and pthreads multithreading.

Source code

C + x86-64 AVX2 (quad SHA-512)

Download: lowest-sha512-quad-pthread.c, sha512-64-avx2.S

Compile: cc -pthread -O1 lowest-sha512-quad-pthread.c sha512-64-avx2.S -o lowest-sha512-quad-pthread

This program uses 256-bit YMM registers to compute four simultaneous instances of the SHA-512 hash function per thread. The program prints abbreviated hashes to standard error and full hashes to standard output (intended for logging).

C + x86-64 AVX (dual SHA-512)

Download: lowest-sha512-dual-pthread.c, sha512-64-avx.S

Compile: cc -pthread -O1 lowest-sha512-dual-pthread.c sha512-64-avx.S -o lowest-sha512-dual-pthread

This program uses 128-bit XMM registers to compute two simultaneous instances of the SHA-512 hash function per thread. The program prints abbreviated hashes to standard error and full hashes to standard output (intended for logging).

C (portable)

Download: lowest-sha512.c

Compile: cc -O1 lowest-sha512.c -o lowest-sha512

This program is in pure C (no x86-64 assembly) and forgoes multithreading, so it can run on a wide variety of environments. Though it’s not very fast, the code illustrates my core ideas involved in this brute-force hashing program and contains fewer minute details than the optimized versions.

Python

Download: lowest-sha512.py

This highly simplified program demonstrates the core of what the C programs are doing. It prints abbreviated hashes to standard output only. It’s about two orders of magnitude slower than the slowest C program here.

Notes

I took my SHA-512 implementation in x86-64 assembly and manually vectorized the calculations using AVX and AVX2 registers and instructions.
The 3-operand arithmetic introduced in AVX is a welcome change from x86’s classic 2-operand form. I had to change my habits and refrain from writing MOV instructions a few times (e.g. because bit shifting is no longer destructive), but overall the clarity of the code is improved.
The code is only available for Linux x86-64 because more than 8 registers are needed in the hash computation. x86-32 can be supported, but the code would need numerous register spills, making it inefficient.
For those who have CPUs that don’t support AVX or AVX2, you can still compile the code, download the Intel Software Development Emulator, and run the binaries under the SDE to verify that they work – albeit at a slowdown factor of about 1000×. (In fact, I developed the quad/AVX2 version using the SDE, because AVX2 is only supported since Intel Haswell (year 2013) and I didn’t have access to such a new CPU in March 2014.)
Similar to the topic here, the essence of Bitcoin mining is finding a byte string whose double-SHA-256 hash is lower than a given threshold. In theory, the ideas on this page can be adapted into a program that performs CPU- or GPU-based Bitcoin mining. However, the formatting of data structures is more complicated, and much faster and more energy-efficient ASICs already dominate the market.