Project Nayuki


Tiny Encryption Algorithm in x86 assembly

The Tiny Encryption Algorithm (TEA) is what it claims to be – a cipher that can be implemented with only a small amount of code. It is a block cipher with a block size of 64 bits and a key size of 128 bits. Despite its simplicity, it is designed to be a serious cipher; it does not compare with simple ciphers like ROT13.

As an exercise, I decided to implement TEA in x86 (32-bit) assembly language to understand x86 programming better and to see just how small the TEA implementation would be. I am pleased to say that the exercise was successful and the resulting code is small and clean.

The x86-32 code, explained

.globl tea_encrypt_x86
tea_encrypt_x86:

This function would be called from C using this prototype:
void tea_encrypt(uint32_t msg[], const uint32_t key);

    pushl  %ebp
    movl   %esp, %ebp
    subl   $12, %esp

Enter and allocate 3 words of local storage on the stack.

    movl   %ebx, 0(%esp)
    movl   %esi, 4(%esp)
    movl   %edi, 8(%esp)

Preserve the callee-save registers, as per the cdecl calling convention.

    movl    8(%ebp), %eax  /* Message */
    movl   12(%ebp), %edx  /* Key */

Load the base of the message array into EAX temporarily, and the base of the key array into EDX.

    movl   0(%eax), %esi
    movl   4(%eax), %edi

Load both 32-bit words of the message and keep them in ESI and EDI.

    movl   $0x9E3779B9, %ecx  /* 'sum' */

Initialize the round constant called sum and keep it in ECX.

.tea_encrypt_top:

Beginning of loop.

    movl   %edi, %ebx
    shll   $4, %ebx
    addl   0(%edx), %ebx
    leal   (%edi,%ecx), %eax
    xorl   %eax, %ebx
    movl   %edi, %eax
    shrl   $5, %eax
    addl   4(%edx), %eax
    xorl   %eax, %ebx
    addl   %ebx, %esi

Encrypt the 0th message word. EAX and EBX are used as temporaries. Summary in C code:
esi += ((edi << 4) + key[0]) ^ (edi + ecx) ^ ((edi >> 5) + key[1]);

    movl   %esi, %ebx
    shll   $4, %ebx
    addl   8(%edx), %ebx
    leal   (%esi,%ecx), %eax
    xorl   %eax, %ebx
    movl   %esi, %eax
    shrl   $5, %eax
    addl   12(%edx), %eax
    xorl   %eax, %ebx
    addl   %ebx, %edi

Encrypt the 1st message word in the same way. C code:
edi += ((esi << 4) + key[2]) ^ (esi + ecx) ^ ((esi >> 5) + key[3]);

    addl   $0x9E3779B9, %ecx
    cmpl   $0x6526B0D9, %ecx
    jne    .tea_encrypt_top

Advance the round constant. Loop if 32 iterations have not been completed yet. (Note: 3310 × 9E3779B916 ≡ 6526B0D916 mod 232.)

    movl   8(%ebp), %eax
    movl   %esi, 0(%eax)
    movl   %edi, 4(%eax)

End of loop. Reload the base of the message array into EAX, then store the message words back into this array (which was passed in by the caller).

    movl   0(%esp), %ebx
    movl   4(%esp), %esi
    movl   8(%esp), %edi

Restore the callee-save registers.

    addl   $12, %esp
    popl   %ebp
    retl

Deallocate stack space and exit.

Source code

This package of code contains a main function that runs a self-check and speed benchmark, a C implemen­tation of TEA, and an assembly implemen­tation of TEA adapted for two archi­tectures (x86, x86-64). To use the code, compile it on Linux with one of these commands depending on your archi­tecture:

  • cc tea-test.c tea-x86.s -o tea-test
  • cc tea-test.c tea-x8664.s -o tea-test

Lessons learned

  • In the beginning, I made many mistakes with the calling convention, improperly saving and loading the callee-save registers. This thankfully resulted in quick segmentation faults, but since the crashes gave no further explanation, I had to carefully audit my code to fix the errors.

  • x86 has insane register pressure. There are 8 general-purpose registers, but ESP and EBP are used for the function call mechanism, for getting the caller’s arguments, and for using the current function’s local variables. So that leaves only 6 registers available for user code. I allocated a register for each of these variables: 0th message word, 1st message word, sum (round constant and indirectly a loop counter), key address (points to an array of 4 words), and two temporaries (which are absolutely necessary). On a better hardware platform, I would give a register to each of the 4 words of the key.

  • The basic x86 instructions generally use 2-operand arithmetic operations, so the value of one argument is destroyed after the operation. For example, to encrypt the 0th message word, the value of the 1st message word is needed three times. In two instances, the 1st message word had to be copied to a temporary register (e.g. movl %edi, %ebx) to preserve its value for later use.

  • Dealing with the 2-operand arithmetic operations, allocating registers, and tracking register lifetime manually was rather uninteresting and time-consuming.

  • The LEA (load effective address) instruction is a curious one. It can be abused to perform addition in 3-operand form, and I took advantage of it to simplify the code.

  • The fact that x86 arithmetic operations can read from or write to memory was extremely useful: It made the code more compact and reduced the inconvenience of not having enough registers to store all the useful variables. By contrast on a typical RISC architecture, memory addressing modes are available only for load and store operations.

  • When I adapted the assembly code to make the x86-64 version, I kept the structure and especially core arithmetic essentially the same. The calling convention was necessarily changed to fit the environment, and I used registers in a different, more natural way. Because all words of the key are loaded into registers here, the core encryption loop has no memory accesses, unlike the x86-32 code.

Notes

  • By x86, I mean what Intel officially calls IA-32.

  • There are some known cryptographic attacks on TEA. A successor to TEA called XTEA was designed to correct some of the weaknesses in TEA. For serious applications, XTEA should be used instead of TEA.

More info