Tiny Encryption Algorithm in x86 assembly
The Tiny Encryption Algorithm (TEA) is what it claims to be – a cipher that can be implemented with only a small amount of code. It is a block cipher with a block size of 64 bits and a key size of 128 bits. Despite its simplicity, it is designed to be a serious cipher; it does not compare with simple ciphers like ROT13.
As an exercise, I decided to implement TEA in x86 (32-bit) assembly language to understand x86 programming better and to see just how small the TEA implementation would be. I am pleased to say that the exercise was successful and the resulting code is small and clean.
The x86-32 code, explained
.globl tea_encrypt_x86 tea_encrypt_x86: |
This function would be called from C using this prototype: |
pushl %ebp movl %esp, %ebp subl $12, %esp |
Enter and allocate 3 words of local storage on the stack. |
movl %ebx, 0(%esp) movl %esi, 4(%esp) movl %edi, 8(%esp) |
Preserve the callee-save registers, as per the cdecl calling convention. |
movl 8(%ebp), %eax /* Message */ movl 12(%ebp), %edx /* Key */ |
Load the base of the message array into EAX temporarily, and the base of the key array into EDX. |
movl 0(%eax), %esi movl 4(%eax), %edi |
Load both 32-bit words of the message and keep them in ESI and EDI. |
movl $0x9E3779B9, %ecx /* 'sum' */ |
Initialize the round constant called |
.tea_encrypt_top: |
Beginning of loop. |
movl %edi, %ebx shll $4, %ebx addl 0(%edx), %ebx leal (%edi,%ecx), %eax xorl %eax, %ebx movl %edi, %eax shrl $5, %eax addl 4(%edx), %eax xorl %eax, %ebx addl %ebx, %esi |
Encrypt the 0th message word. EAX and EBX are used as temporaries. Summary in C code: |
movl %esi, %ebx shll $4, %ebx addl 8(%edx), %ebx leal (%esi,%ecx), %eax xorl %eax, %ebx movl %esi, %eax shrl $5, %eax addl 12(%edx), %eax xorl %eax, %ebx addl %ebx, %edi |
Encrypt the 1st message word in the same way. C code: |
addl $0x9E3779B9, %ecx cmpl $0x6526B0D9, %ecx jne .tea_encrypt_top |
Advance the round constant. Loop if 32 iterations have not been completed yet. (Note: 3310 × 9E3779B916 ≡ 6526B0D916 mod 232.) |
movl 8(%ebp), %eax movl %esi, 0(%eax) movl %edi, 4(%eax) |
End of loop. Reload the base of the message array into EAX, then store the message words back into this array (which was passed in by the caller). |
movl 0(%esp), %ebx movl 4(%esp), %esi movl 8(%esp), %edi |
Restore the callee-save registers. |
addl $12, %esp popl %ebp retl |
Deallocate stack space and exit. |
Source code
This package of code contains a main function that runs a self-check and speed benchmark, a C implementation of TEA, and an assembly implementation of TEA adapted for two architectures (x86, x86-64). To use the code, compile it on Linux with one of these commands depending on your architecture:
cc tea-test.c tea-x86.s -o tea-test
cc tea-test.c tea-x8664.s -o tea-test
Lessons learned
In the beginning, I made many mistakes with the calling convention, improperly saving and loading the callee-save registers. This thankfully resulted in quick segmentation faults, but since the crashes gave no further explanation, I had to carefully audit my code to fix the errors.
x86 has insane register pressure. There are 8 general-purpose registers, but ESP and EBP are used for the function call mechanism, for getting the caller’s arguments, and for using the current function’s local variables. So that leaves only 6 registers available for user code. I allocated a register for each of these variables: 0th message word, 1st message word,
sum
(round constant and indirectly a loop counter), key address (points to an array of 4 words), and two temporaries (which are absolutely necessary). On a better hardware platform, I would give a register to each of the 4 words of the key.The basic x86 instructions generally use 2-operand arithmetic operations, so the value of one argument is destroyed after the operation. For example, to encrypt the 0th message word, the value of the 1st message word is needed three times. In two instances, the 1st message word had to be copied to a temporary register (e.g.
movl %edi, %ebx
) to preserve its value for later use.Dealing with the 2-operand arithmetic operations, allocating registers, and tracking register lifetime manually was rather uninteresting and time-consuming.
The LEA (load effective address) instruction is a curious one. It can be abused to perform addition in 3-operand form, and I took advantage of it to simplify the code.
The fact that x86 arithmetic operations can read from or write to memory was extremely useful: It made the code more compact and reduced the inconvenience of not having enough registers to store all the useful variables. By contrast on a typical RISC architecture, memory addressing modes are available only for load and store operations.
When I adapted the assembly code to make the x86-64 version, I kept the structure and especially core arithmetic essentially the same. The calling convention was necessarily changed to fit the environment, and I used registers in a different, more natural way. Because all words of the key are loaded into registers here, the core encryption loop has no memory accesses, unlike the x86-32 code.
Notes
By x86, I mean what Intel officially calls IA-32.
There are some known cryptographic attacks on TEA. A successor to TEA called XTEA was designed to correct some of the weaknesses in TEA. For serious applications, XTEA should be used instead of TEA.