# Fast Fourier transform in x86 assembly

I created this FFT library to assess the effort and speedup of a hand-written SIMD vectorized implementation. The assembly implementation is under 150 lines of clear code; it achieves a speedup of 2× on small inputs, but only slight speedup on large inputs (memory-bound?).

The assembly code requires x86-64 AVX and FMA3, supported by Intel Haswell and later CPUs. The library uses the same equation as the implementation on the page Free small FFT in multiple languages, only the forward radix-2 FFT is provided, and the FFT is unscaled.

## Source code

- fft-test.c: Runnable main test program (portable)
- fft.h: Exported function prototypes for all FFT implementations (portable)
- fft-portable.c: FFT implementation for all CPU architectures
- fft-x8664-avx.s: FFT core implementation in x86-64 AVX assembly
- fft-x8664-avx-aux.c: Auxiliary functions in C to support the AVX implementation
- fft-model-of-x8664-avx.c: Portable C code illustrating how the AVX code works

To compile, use *one* of these commands:

`cc -O1 -o fft-test fft-test.c fft-portable.c -lm`

`cc -O1 -o fft-test fft-test.c fft-x8664-avx-aux.c fft-x8664-avx.s -lm`

`cc -O1 -o fft-test fft-test.c fft-x8664-avx-aux.c fft-model-of-x8664-avx.c -lm`

License: MIT (open source)

Source code notes:

The x86-64 AVX assembly code only covers the FFT core. The table initialization and destruction are still implemented in C (fft-x8664-avx-aux.c) because they’re not time-critical.

The key idea behind the x86-64 AVX implementation is that all processing happens on vectors of four 64-bit floats, rather than individual scalar floats. In theory this could be 4× as fast as scalar code, but is limited by memory bandwidth, FPU execution resources, inherently serial loop control code, etc. Also, the cosine/sine tables are interleaved in blocks of 4 elements, and the table values for each FFT size are packed together so that the memory access stride is always contiguous.

The portable C implementation is the same as that found on my free FFT page, except that the table computation is separated out to an initialization phase (

`fft_init()`

).As a bonus, the x86-64 AVX auxiliary C code has an implementation of a more accurate sine function by doing argument reduction to a smaller domain (1/8th of a circle instead of a half circle).

The C model of x86-64 AVX code exists to make it easier for the reader to understand what the AVX code is doing in a more familiar notation and without having to look up instruction names in the Intel manual.

## Benchmark

For each implementation and FFT size, the nanoseconds per FFT computation is shown in the table.

FFT size | Portable C, -O0 | Portable C, best | C model of AVX | AVX impl | Speedup |
---|---|---|---|---|---|

4 | 83 | 43 | 25 | 30 | 0.83× |

8 | 178 | 68 | 40 | 40 | 1.00× |

16 | 401 | 115 | 79 | 57 | 1.39× |

32 | 910 | 221 | 173 | 91 | 1.90× |

64 | 2 086 | 457 | 385 | 185 | 2.08× |

128 | 5 732 | 984 | 851 | 391 | 2.18× |

256 | 10 661 | 2 187 | 1 902 | 879 | 2.16× |

512 | 23 632 | 5 046 | 4 559 | 2 960 | 1.54× |

1 024 | 51 861 | 11 362 | 10 576 | 6 411 | 1.65× |

2 048 | 115 539 | 26 845 | 24 615 | 14 471 | 1.70× |

4 096 | 254 768 | 69 851 | 64 313 | 38 166 | 1.69× |

8 192 | 561 347 | 170 864 | 148 510 | 92 166 | 1.61× |

16 384 | 1 266 715 | 368 223 | 343 670 | 209 489 | 1.64× |

32 768 | 2 878 547 | 897 410 | 700 666 | 502 660 | 1.39× |

65 536 | 6 441 139 | 2 143 479 | 1 528 429 | 1 083 482 | 1.41× |

131 072 | 14 593 055 | 4 855 895 | 3 289 835 | 2 355 535 | 1.40× |

262 144 | 33 355 806 | 12 773 822 | 7 373 814 | 5 541 606 | 1.33× |

524 288 | 86 464 543 | 33 790 638 | 18 393 958 | 15 442 926 | 1.19× |

1 048 576 | 220 892 337 | 109 162 034 | 55 214 393 | 48 812 958 | 1.13× |

2 097 152 | 598 481 088 | 279 865 295 | 129 967 843 | 112 598 745 | 1.15× |

4 194 304 | 1 352 574 930 | 657 329 787 | 291 134 128 | 246 925 032 | 1.18× |

8 388 608 | 2 775 037 172 | 1 452 446 318 | 625 878 736 | 531 560 736 | 1.18× |

16 777 216 | 5 936 787 976 | 3 158 039 436 | 1 325 008 402 | 1 113 777 707 | 1.19× |

33 554 432 | 12 731 136 778 | 6 816 764 522 | 2 807 901 996 | 2 385 468 193 | 1.18× |

67 108 864 | 27 154 169 335 | 14 967 892 000 | 5 908 496 753 | 5 029 359 709 | 1.17× |

134 217 728 | 57 980 427 807 | 33 244 483 338 | 12 538 038 624 | 10 419 909 000 | 1.20× |

268 435 456 | 121 103 066 499 | 71 678 823 683 | 25 447 116 592 | 21 696 563 979 | 1.17× |

536 870 912 | 266 202 828 987 | 161 371 471 435 | 53 985 182 920 | 42 373 561 605 | 1.27× |

The benchmark results are a mixed bag, but a few conclusions can be drawn from the numbers:

The pure-C model of the AVX code has interesting behavior: It starts off being only a little faster than the portable C code, yet for large FFT sizes trails the speed of the true AVX implementation by a small constant factor. This suggests that at medium sizes, the AVX code does computations more efficiently than the C code due to vectorization. But it also suggests that at large sizes, the FFT is limited by memory bandwidth; furthermore the large access strides used in the portable C code (but not in the AVX code or its C model) are probably quite inefficient.

The hand-coded x86-64 AVX code is from 1.2× to 2.2× faster than the best C code (which is the AVX model code). This speedup was not as impressive as hoped.

Compiling C code with optimizations enabled is obviously a good idea. The portable C FFT code is 1.5× to 4× faster when compiled with GCC -O1 compared to -O0. Overall, the best results were achieved for fft-portable.c at -O2 -march=native, for fft-model-of-x8664-avx.c at -O3 -march=native, and fft-x8664-avx.s at -O2 (no -march). Also -O1 and -Os were evaluated, and they were competitive but not the best.

All numbers were obtained on an Intel Core i5-4690 (Haswell) 3.50 GHz CPU, running Ubuntu 16.10, using GCC 6.2.0. Each number is the minimum of 32 trials (except the bottom 5 rows, which use fewer trials due to the long run time).