Tiny Encryption Algorithm in x86 assembly

The Tiny Encryption Algorithm (TEA) is what it claims to be – a cipher that can be implemented with only a small amount of code. It is a block cipher with a block size of 64 bits and a key size of 128 bits. Despite its simplicity, it is designed to be a serious cipher; it does not compare with simple ciphers like ROT13.

As an exercise, I decided to implement TEA in x86 (32-bit) assembly language to understand x86 programming better and to see just how small the TEA implementation would be. I am pleased to say that the exercise was successful and the resulting code is small and clean.

The x86-32 code, explained

.globl tea_encrypt_x86 tea_encrypt_x86:	This function would be called from C using this prototype: `void tea_encrypt(uint32_t msg[], const uint32_t key);`
pushl %ebp movl %esp, %ebp subl $12, %esp	Enter and allocate 3 words of local storage on the stack.
movl %ebx, 0(%esp) movl %esi, 4(%esp) movl %edi, 8(%esp)	Preserve the callee-save registers, as per the cdecl calling convention.
movl 8(%ebp), %eax /* Message / movl 12(%ebp), %edx / Key */	Load the base of the message array into EAX temporarily, and the base of the key array into EDX.
movl 0(%eax), %esi movl 4(%eax), %edi	Load both 32-bit words of the message and keep them in ESI and EDI.
movl $0x9E3779B9, %ecx /* 'sum' */	Initialize the round constant called `sum` and keep it in ECX.
.tea_encrypt_top:	Beginning of loop.
movl %edi, %ebx shll $4, %ebx addl 0(%edx), %ebx leal (%edi,%ecx), %eax xorl %eax, %ebx movl %edi, %eax shrl $5, %eax addl 4(%edx), %eax xorl %eax, %ebx addl %ebx, %esi	Encrypt the 0th message word. EAX and EBX are used as temporaries. Summary in C code: `esi += ((edi << 4) + key[0]) ^ (edi + ecx) ^ ((edi >> 5) + key[1]);`
movl %esi, %ebx shll $4, %ebx addl 8(%edx), %ebx leal (%esi,%ecx), %eax xorl %eax, %ebx movl %esi, %eax shrl $5, %eax addl 12(%edx), %eax xorl %eax, %ebx addl %ebx, %edi	Encrypt the 1st message word in the same way. C code: `edi += ((esi << 4) + key[2]) ^ (esi + ecx) ^ ((esi >> 5) + key[3]);`
addl $0x9E3779B9, %ecx cmpl $0x6526B0D9, %ecx jne .tea_encrypt_top	Advance the round constant. Loop if 32 iterations have not been completed yet. (Note: 33₁₀ × 9E3779B9₁₆ ≡ 6526B0D9₁₆ mod 2³².)
movl 8(%ebp), %eax movl %esi, 0(%eax) movl %edi, 4(%eax)	End of loop. Reload the base of the message array into EAX, then store the message words back into this array (which was passed in by the caller).
movl 0(%esp), %ebx movl 4(%esp), %esi movl 8(%esp), %edi	Restore the callee-save registers.
addl $12, %esp popl %ebp retl	Deallocate stack space and exit.

Source code

This package of code contains a main function that runs a self-check and speed benchmark, a C implementation of TEA, and an assembly implementation of TEA adapted for two architectures (x86, x86-64). To use the code, compile it on Linux with one of these commands depending on your architecture:

cc tea-test.c tea-x86.s -o tea-test
cc tea-test.c tea-x8664.s -o tea-test

Lessons learned

In the beginning, I made many mistakes with the calling convention, improperly saving and loading the callee-save registers. This thankfully resulted in quick segmentation faults, but since the crashes gave no further explanation, I had to carefully audit my code to fix the errors.
x86 has insane register pressure. There are 8 general-purpose registers, but ESP and EBP are used for the function call mechanism, for getting the caller’s arguments, and for using the current function’s local variables. So that leaves only 6 registers available for user code. I allocated a register for each of these variables: 0th message word, 1st message word, sum (round constant and indirectly a loop counter), key address (points to an array of 4 words), and two temporaries (which are absolutely necessary). On a better hardware platform, I would give a register to each of the 4 words of the key.
The basic x86 instructions generally use 2-operand arithmetic operations, so the value of one argument is destroyed after the operation. For example, to encrypt the 0th message word, the value of the 1st message word is needed three times. In two instances, the 1st message word had to be copied to a temporary register (e.g. movl %edi, %ebx) to preserve its value for later use.
Dealing with the 2-operand arithmetic operations, allocating registers, and tracking register lifetime manually was rather uninteresting and time-consuming.
The LEA (load effective address) instruction is a curious one. It can be abused to perform addition in 3-operand form, and I took advantage of it to simplify the code.
The fact that x86 arithmetic operations can read from or write to memory was extremely useful: It made the code more compact and reduced the inconvenience of not having enough registers to store all the useful variables. By contrast on a typical RISC architecture, memory addressing modes are available only for load and store operations.
When I adapted the assembly code to make the x86-64 version, I kept the structure and especially core arithmetic essentially the same. The calling convention was necessarily changed to fit the environment, and I used registers in a different, more natural way. Because all words of the key are loaded into registers here, the core encryption loop has no memory accesses, unlike the x86-32 code.

Notes

By x86, I mean what Intel officially calls IA-32.
There are some known cryptographic attacks on TEA. A successor to TEA called XTEA was designed to correct some of the weaknesses in TEA. For serious applications, XTEA should be used instead of TEA.

Project Nayuki

Tiny Encryption Algorithm in x86 assembly

The x86-32 code, explained

Source code

Lessons learned

Notes

More info