Project Nayuki


Fast Fourier transform in x86 assembly

I created this FFT library to assess the effort and speedup of a hand-written SIMD vectorized implementation. The assembly implementation is under 150 lines of clear code; it achieves a speedup of 2× on small inputs, but only slight speedup on large inputs (memory-bound?).

The assembly code requires x86-64 AVX and FMA3, supported by Intel Haswell and later CPUs. The library uses the same equation as the implementation on the page Free small FFT in multiple languages, only the forward radix-2 FFT is provided, and the FFT is unscaled.

Source code

To compile, use one of these commands:

  • cc -O1 -o fft-test fft-test.c fft-portable.c -lm
  • cc -O1 -o fft-test fft-test.c fft-x8664-avx-aux.c fft-x8664-avx.s -lm
  • cc -O1 -o fft-test fft-test.c fft-x8664-avx-aux.c fft-model-of-x8664-avx.c -lm

License: MIT (open source)

Source code notes:

  • The x86-64 AVX assembly code only covers the FFT core. The table initialization and destruction are still implemented in C (fft-x8664-avx-aux.c) because they’re not time-critical.

  • The key idea behind the x86-64 AVX implementation is that all processing happens on vectors of four 64-bit floats, rather than individual scalar floats. In theory this could be 4× as fast as scalar code, but is limited by memory bandwidth, FPU execution resources, inherently serial loop control code, etc. Also, the cosine/sine tables are interleaved in blocks of 4 elements, and the table values for each FFT size are packed together so that the memory access stride is always contiguous.

  • The portable C implementation is the same as that found on my free FFT page, except that the table computation is separated out to an initialization phase (fft_init()).

  • As a bonus, the x86-64 AVX auxiliary C code has an implementation of a more accurate sine function by doing argument reduction to a smaller domain (1/8th of a circle instead of a half circle).

  • The C model of x86-64 AVX code exists to make it easier for the reader to understand what the AVX code is doing in a more familiar notation and without having to look up instruction names in the Intel manual.

Benchmark

For each implementation and FFT size, the nanoseconds per FFT computation is shown in the table.

FFT size Portable C, -O0 Portable C, best C model of AVX AVX impl Speedup
4834325300.83×
81786840401.00×
1640111579571.39×
32910221173911.90×
642 0864573851852.08×
1285 7329848513912.18×
25610 6612 1871 9028792.16×
51223 6325 0464 5592 9601.54×
1 02451 86111 36210 5766 4111.65×
2 048115 53926 84524 61514 4711.70×
4 096254 76869 85164 31338 1661.69×
8 192561 347170 864148 51092 1661.61×
16 3841 266 715368 223343 670209 4891.64×
32 7682 878 547897 410700 666502 6601.39×
65 5366 441 1392 143 4791 528 4291 083 4821.41×
131 07214 593 0554 855 8953 289 8352 355 5351.40×
262 14433 355 80612 773 8227 373 8145 541 6061.33×
524 28886 464 54333 790 63818 393 95815 442 9261.19×
1 048 576220 892 337109 162 03455 214 39348 812 9581.13×
2 097 152598 481 088279 865 295129 967 843112 598 7451.15×
4 194 3041 352 574 930657 329 787291 134 128246 925 0321.18×
8 388 6082 775 037 1721 452 446 318625 878 736531 560 7361.18×
16 777 2165 936 787 9763 158 039 4361 325 008 4021 113 777 7071.19×
33 554 43212 731 136 7786 816 764 5222 807 901 9962 385 468 1931.18×
67 108 86427 154 169 33514 967 892 0005 908 496 7535 029 359 7091.17×
134 217 72857 980 427 80733 244 483 33812 538 038 62410 419 909 0001.20×
268 435 456121 103 066 49971 678 823 68325 447 116 59221 696 563 9791.17×
536 870 912266 202 828 987161 371 471 43553 985 182 92042 373 561 6051.27×

The benchmark results are a mixed bag, but a few conclusions can be drawn from the numbers:

  • The pure-C model of the AVX code has interesting behavior: It starts off being only a little faster than the portable C code, yet for large FFT sizes trails the speed of the true AVX implementation by a small constant factor. This suggests that at medium sizes, the AVX code does computations more efficiently than the C code due to vectorization. But it also suggests that at large sizes, the FFT is limited by memory bandwidth; furthermore the large access strides used in the portable C code (but not in the AVX code or its C model) are probably quite inefficient.

  • The hand-coded x86-64 AVX code is from 1.2× to 2.2× faster than the best C code (which is the AVX model code). This speedup was not as impressive as hoped.

  • Compiling C code with optimizations enabled is obviously a good idea. The portable C FFT code is 1.5× to 4× faster when compiled with GCC -O1 compared to -O0. Overall, the best results were achieved for fft-portable.c at -O2 -march=native, for fft-model-of-x8664-avx.c at -O3 -march=native, and fft-x8664-avx.s at -O2 (no -march). Also -O1 and -Os were evaluated, and they were competitive but not the best.

  • All numbers were obtained on an Intel Core i5-4690 (Haswell) 3.50 GHz CPU, running Ubuntu 16.10, using GCC 6.2.0. Each number is the minimum of 32 trials (except the bottom 5 rows, which use fewer trials due to the long run time).