Project Nayuki


Fast Fourier transform in x86 assembly

I created this FFT library to assess the effort and speedup of a hand-written SIMD vectorized implementation. The assembly implementation is under 150 lines of clear code; it achieves a speedup of 2× on small inputs, but only slight speedup on large inputs (memory-bound?).

The assembly code requires x86-64 AVX and FMA3, supported by Intel Haswell and later CPUs. The library uses the same equation as the implementation on the page Free small FFT in multiple languages, only the forward radix-2 FFT is provided, and the FFT is unscaled.

Source code

To compile, use one of these commands:

License: MIT (open source)

Source code notes:

Benchmark

For each implementation and FFT size, the nanoseconds per FFT computation is shown in the table.

FFT size Portable C, -O0 Portable C, best C model of AVX AVX impl Speedup
4834325300.83×
81786840401.00×
1640111579571.39×
32910221173911.90×
642 0864573851852.08×
1285 7329848513912.18×
25610 6612 1871 9028792.16×
51223 6325 0464 5592 9601.54×
1 02451 86111 36210 5766 4111.65×
2 048115 53926 84524 61514 4711.70×
4 096254 76869 85164 31338 1661.69×
8 192561 347170 864148 51092 1661.61×
16 3841 266 715368 223343 670209 4891.64×
32 7682 878 547897 410700 666502 6601.39×
65 5366 441 1392 143 4791 528 4291 083 4821.41×
131 07214 593 0554 855 8953 289 8352 355 5351.40×
262 14433 355 80612 773 8227 373 8145 541 6061.33×
524 28886 464 54333 790 63818 393 95815 442 9261.19×
1 048 576220 892 337109 162 03455 214 39348 812 9581.13×
2 097 152598 481 088279 865 295129 967 843112 598 7451.15×
4 194 3041 352 574 930657 329 787291 134 128246 925 0321.18×
8 388 6082 775 037 1721 452 446 318625 878 736531 560 7361.18×
16 777 2165 936 787 9763 158 039 4361 325 008 4021 113 777 7071.19×
33 554 43212 731 136 7786 816 764 5222 807 901 9962 385 468 1931.18×
67 108 86427 154 169 33514 967 892 0005 908 496 7535 029 359 7091.17×
134 217 72857 980 427 80733 244 483 33812 538 038 62410 419 909 0001.20×
268 435 456121 103 066 49971 678 823 68325 447 116 59221 696 563 9791.17×
536 870 912266 202 828 987161 371 471 43553 985 182 92042 373 561 6051.27×

The benchmark results are a mixed bag, but a few conclusions can be drawn from the numbers: