Fast Fourier transform in x86 assembly

I have a simple fast Fourier transform implementation on the page Free small FFT in multiple languages, whose asymptotic time complexity is the optimal Θ(n log n). I wanted to know how much additional speed-up could be obtained by reorganizing memory access patterns and using SIMD vector arithmetic instructions. The best assembly language implementation here has under 150 lines of clear code and runs 2× to 4× faster than the simple C++ implementation.

The assembly code requires x86-64 AVX and FMA3, supported by Intel Haswell and later CPUs. The library uses the same equation as the other page, only the forward radix-2 FFT is provided, and the FFT is unscaled.

Source code

All the C++ source files are fully portable and do not require x86-64 or vector instructions. All the .s files (assembly language) are only for Unix on x86-64.

FftTest.cpp: Runnable main test program
Fft.cpp: FFT library implementation, excluding the transform core
Fft.hpp: FFT library header
Fft-transformImpl-simple.cpp: FFT core which computes each level fully and reads the exponential table with varying strides, simply adapted from FftComplex.cpp
Fft-transformImpl-linearVector.cpp: FFT core which computes each level fully and reads the exponential table with unit stride, and imitates the length-2 and length-4 vector operations of the AVX code
Fft-transformImpl-linearVector.s: FFT core which computes each level fully and reads the exponential table with unit stride, only for Unix on x86-64
Fft-transformImpl-recurseVector.cpp: FFT core which computes levels recursively in a bottom-up way and reads the exponential table with unit stride, and imitates the length-2 and length-4 vector operations of the AVX code
Fft-transformImpl-recurseVector.s: FFT core which computes levels recursively in a bottom-up way and reads the exponential table with unit stride, only for Unix on x86-64

License: MIT (open source)

To compile, use a command like: Compiler -std=c++17 Optimizations -o FftTest FftTest.cpp Fft.cpp TransformImpl.cpp/s. For example:

c++ -std=c++17 -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-simple.cpp
g++ -std=c++17 -O -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-linearVector.s
g++ -std=c++17 -Os -ftree-vectorize -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-linearVector.cpp
g++ -std=c++17 -O2 -march=native -mtune=native -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-recurseVector.s
clang++ -std=c++17 -O3 -march=native -o FftTest FftTest.cpp Fft.cpp Fft-transformImpl-recurseVector.cpp

Source code notes:

While there are 5 implementations of the FFT core, the precomputation of tables has only one implementation in C++ because it’s not as time-critical.
Each of the two assembly implementations has a pure-C++ version which use only scalar operations to almost perfectly mimic the vector operations and memory access patterns. This allows the reader to understand what the AVX code is doing in a more familiar notation and without having to look up instruction names in the Intel manual. Also in theory, if the C++ compiler is good at vectorization, then it should generate machine code that is nearly identical to the hand-written vector assembly code.
All the C++ code files are portable and can work on any system with any definition of size_t. But when linking Fft.cpp to Fft-transformImpl-*.s (x86-64 assembly code), it is required that size_t = uint64_t (in other words, std::numeric_limits<size_t>::digits == 64), otherwise the assembly function implementation will corrupt memory and crash. This constraint is not checked at compile time because it would require extra C++ source files.
A previous version of this page featured code that: was implemented in C, arranged the real and imaginary parts of the complex vector in separate arrays instead of interleaving them, and didn’t have a recursive transform. The old codebase is slower and less readable, and is no longer supported.

Memory access patterns

This animation visualizes how the various FFT core implementations read/write different elements of the two arrays over time.

FFT size:
Speed:
Implementation:	Simple LinearVector RecurseVector
Animation:

Input/output vector:

Exponential table:

Note: Each cell represents an std::complex<double>, whose size is 16 bytes.

Benchmark

The main numbers in the tables are the number of nanoseconds per FFT computation for a given FFT size and transform core implementation.

Compiler: GCC 10.2.0

Transform implementation	simple.cpp	linearVector.cpp	linearVector.s	recurseVector.cpp	recurseVector.s
Compile options	-O2 -march=native -mtune=native -ftree-vectorize	-O1 -march=native -mtune=native -ftree-vectorize	-O1 -march=native -mtune=native	-O1 -march=native -mtune=native -ftree-vectorize	-Os -march=native -mtune=native
FFT size
8	76	44	41	40	85
16	127	61	59	62	112
32	238	107	101	112	166
64	488	218	212	234	292
128	1 016	471	443	500	563
256	2 232	1 050	1 011	1 127	1 181
512	4 847	2 371	2 316	2 536	2 561
1 024	10 634	5 431	5 203	5 701	5 668
2 048	23 595	12 374	12 079	12 931	12 534
4 096	60 421	34 200	34 505	32 288	32 007
8 192	145 182	82 865	82 734	79 071	77 133
16 384	328 538	202 016	196 209	179 904	178 106
32 768	794 685	472 896	473 776	399 099	395 295
65 536	1 848 536	1 011 156	1 017 823	912 735	896 032
131 072	4 048 978	2 229 407	2 271 899	1 919 836	1 856 634
262 144	9 568 106	5 377 714	5 509 565	4 463 868	4 555 393
524 288	29 331 856	18 053 211	18 372 963	11 717 066	11 640 852
1 048 576	84 874 361	44 012 739	44 233 226	27 502 046	27 383 013
2 097 152	197 736 251	94 209 742	95 788 139	58 103 293	61 914 542
4 194 304	477 559 022	200 105 030	197 380 868	125 288 316	126 381 213
8 388 608	1 150 845 702	442 110 966	443 317 765	297 457 984	284 732 072
16 777 216	2 701 063 686	985 539 583	1 015 481 605	702 461 388	694 133 979
33 554 432	6 634 030 343	2 170 786 691	2 219 600 086	1 593 194 794	1 542 642 833
67 108 864	14 389 143 994	4 795 731 589	4 782 791 173	4 319 947 607	3 634 788 206
134 217 728	34 308 652 911	10 767 479 096	10 808 237 981	10 046 959 831	10 735 480 888
268 435 456	78 795 861 504	23 323 416 514	23 280 238 633	18 613 868 128	19 604 131 740
536 870 912	191 424 874 237	55 481 062 948	58 530 032 212	48 347 285 488	35 887 147 074

Compiler: Clang 11.0.0

Transform implementation	simple.cpp	linearVector.cpp	linearVector.s	recurseVector.cpp	recurseVector.s
Compile options	-O3 -march=x86-64	-O3 -march=native	-O2 -march=native	-O2 -march=native	-O2 -march=x86-64
FFT size
8	69	40	41	37	85
16	130	55	59	59	111
32	272	99	100	108	165
64	581	203	206	227	290
128	1 269	443	445	490	561
256	2 872	1 027	1 010	1 120	1 180
512	6 351	2 358	2 303	2 552	2 576
1 024	13 939	5 428	5 264	5 852	5 886
2 048	30 282	12 800	12 104	13 468	12 677
4 096	71 968	34 991	33 753	34 826	32 480
8 192	170 657	85 148	81 707	84 985	76 661
16 384	378 781	210 178	210 709	195 597	176 178
32 768	922 679	490 980	473 141	424 887	394 891
65 536	2 125 945	1 048 564	1 066 460	961 162	854 839
131 072	4 510 866	2 218 843	2 177 340	1 969 001	1 856 780
262 144	10 488 899	5 596 053	5 814 941	4 742 676	4 477 989
524 288	31 566 168	17 693 103	17 897 829	11 727 889	11 621 123
1 048 576	84 814 585	45 435 877	43 211 773	26 532 912	26 297 249
2 097 152	217 636 774	97 135 549	94 059 053	60 634 895	58 207 924
4 194 304	510 150 349	205 862 060	196 633 667	132 114 212	121 363 532
8 388 608	1 264 917 458	437 783 298	440 487 726	314 048 050	304 741 011
16 777 216	2 969 336 243	984 612 216	988 402 710	717 541 640	697 949 297
33 554 432	6 552 070 064	2 187 774 518	2 145 750 296	1 650 733 006	1 539 819 002
67 108 864	14 178 321 121	5 321 209 567	4 661 419 579	4 222 166 271	3 962 320 705
134 217 728	32 760 302 811	11 068 710 488	11 110 860 590	8 870 709 543	8 686 626 184
268 435 456	71 142 211 054	21 113 781 611	24 373 299 955	18 847 519 484	19 521 796 294
536 870 912	186 516 806 728	55 904 453 622	57 198 381 577	48 821 368 142	46 479 427 904

Methodology and observations:

All timing numbers were obtained on an Intel Core i5-4690 (Haswell) 3.50 GHz CPU (Turbo Boost disabled), running Ubuntu 20.10.
For the GCC compiler, I tried almost all permutations of these optimization flags: -O0 -O1 -O2 -O3 -Os -march=x86-64/native -mtune=generic/native -ftree-vectorize
For the Clang compiler, I tried almost all permutations of these optimization flags: -O0 -O1 -O2 -O3 -Os -Oz -march=x86-64/native
Although I tested many sets of compiler flags, for each transform core, I selected the set of flags that produced the fastest run times overall and published just those flags and numbers.
For the simple transform core, the best code produced by Clang is about 10% slower than GCC. For the other 4 cores, there is no clear winner between GCC and Clang; hence I consider both compilers to be competitive in the quality of their generated code.
The primary difference between the simple core and the vector cores is that the simple algorithm reads the complex exponential (trigonometric) table with varying strides, whereas the other cores read the table contiguously without skipping any elements. The result of this change is that all the other cores are at least 2× and up to 4× faster than the simple core.
For the linear vector transform core algorithm, the C++ implementation is about the same speed as the hand-written assembly. The compilers do a good job at vectorizing the code.
For the recursive vector transform core algorithm, the C++ implementation is a few percent slower than the hand-written assembly. The compilers still do a decent job at vectorization.
The linear vector core is faster than the recursive vector core up to around FFT size 1024. The recursive vector core is faster than the linear vector core around FFT size 4096 and up. Hence, the linear core is better at smaller sizes and the recursive core is better at larger sizes, which is what one would expect from the memory hierarchy and cache-oblivious algorithms.
The largest FFT size of 2²⁹ complex elements yielded significant variance in the timings (based on unpublished raw data), so I wouldn’t trust that row for drawing conclusions. This is strange because the memory consumption for that size, including both the data vector and precomputed tables, is about 20 GiB, which clearly fits in the machine’s 32 GiB of RAM without swapping.