RC4 cipher in x86 assembly

The core of the RC4 stream cipher is a very small amount of code, so I decided to implement it in x86 assembly language for fun to see how fast I could make it go.

Excluding comments and blank lines, my x86 code is 40 lines long, and the main encryption loop consists of only 14 instructions. Interestingly, there are just enough registers on x86 to hold all the relevant values for this algorithm, so no register spills are needed.

Source code

This code offers a reusable function that performs RC4 encryption and also a demo main() function that runs a self-test and a speed benchmark.

Files:

To use this code, compile it on Linux with this command:

cc rc4-test.c rc4-x86.s -o rc4-test (32-bit only)
cc rc4-test.c rc4-x8664.s -o rc4-test (64-bit only)

Then run the executable with ./rc4-test.

Benchmark results

A quick, informal benchmark on Intel Core 2 Quad Q6600 2.40 GHz (using a single core), Ubuntu 10.04, GCC 4.4.3 gives these numbers:

C code, GCC -O0: 62.7 MiB/s
C code, GCC -O1: 171 MiB/s
C code, GCC -O2: 174 MiB/s
C code, GCC -O3: 219 MiB/s
C code, GCC -O1 -fomit-frame-pointer: 187 MiB/s
C code, GCC -O2 -fomit-frame-pointer: 191 MiB/s
C code, GCC -O3 -fomit-frame-pointer: 219 MiB/s
x86 code, GCC -O1: 318 MiB/s

Therefore, we see that my x86 code is 1.45× as fast as my C code. Manually writing and optimizing assembly code seems to pay off significantly in this case.

Notes

Since RC4 is a stream cipher, encryption and decryption are the same operation. Hence I am not providing an explicit decrypt() function.
Using 8-bit instructions in 32-bit mode has no intrinsic penalty, but the important thing to watch out for is partial register accesses. For example, if you update AL and then read EAX, the read will sometimes incur a delay that significantly hurts performance. In the code, the usages of movbzl from the lower 8 bits of a register to the whole 32-bit register itself are functionally equivalent to andl $0xFF, but the latter is much slower because of partial register access.
The x86-64 version takes advantage of the fact that 32-bit operations (addl, movbzl, etc.) automatically clear the top 32 bits of the destination 64-bit register. It also exploits the fact that 32-bit operations on the first 8 registers (EAX, etc., but not R8 to R15) do not use a REX prefix byte.

More info

Wikipedia: RC4