# Fast Fourier Transform
An asymptotically-improved O(NlogN) algorithm for implementing the Discrete Fourier Transform (naively O(N2)).
# 1 Libraries
- FFTW, fast, but CPU only; also some possible issues with concurrency in shared plugin hosts.
# 2 Decimation In Time
Define DFT(N/T,t,x) to be the T size N/T DFTs with stride offset t∈[0,T).
Then the DFT(N/1,0,x) can be decomposed into T smaller DFTs thus:
X=DFT(N/1,0,x)Xk=N−1∑n=0e−2πink/Nxn=T−1∑t=0N/T−1∑m=0e−2πi(Tm+t)k/NxTm+t=T−1∑t=0e−2πikt/NN/T−1∑m=0e−2πikm/(N/T)xTm+t=T−1∑t=0e−2πikt/NDFT(N/T,t,x)Yt,k=DFT(N/T,t,x)Xk=T−1∑t=0e−2πikt/NYt,k
Continuing, it turns out the recombination is in fact a set of size T DFTs:
Xk=T−1∑t=0e−2πikt/Te−2πikt/Ne−2πikt/TYt,kZt,k=e−2πikt/Ne−2πikt/TYt,kXk=T−1∑t=0e−2πikt/TZt.k=DFT(T,0,Zt,k)
That is, X, the length N DFT of x, can be calculated by first doing T-many length N/T DFTs interleaved of x, multiplying by twiddle factors, then recombining with N/T-many length T DFTs. Inverse DFT follows the same process in reverse.
Applying this recursively gives an implementation of the Fast Fourier Transform algorithm, reducing the cost of DFT from O(N2) to O(NlogN).
# 3 Parallel FFTW
Once you have “planned” your FFT using the FFTW library, you can execute it from another thread. One strategy for large FFTs is to split it into a set of smaller FFTs as above, and execute each in parallel (perhaps using OpenMP).
I implemented it in a phase vocoder time stretcher:
16 threads is fastest in terms of waiting, but is about 35% less efficient in terms of output duration vs CPU seconds than when using 2 threads (the minimum my current code supports). 8 threads is probably the sweet spot: 15% less efficient than 2 threads with waiting times 3x shorter (fast enough for realtime streaming).
# threads output/CPU seconds realtime speed
2 0.21 0.420
4 0.20 0.746
8 0.18 1.232
16 0.13 2.015
Compared with using FFTW3’s built-in OpenMP support:
# threads output/CPU seconds realtime speed
1 0.285 0.285
2 0.251 0.482
4 0.208 0.762
8 0.156 1.149
16 0.125 1.259
Throughput is higher at higher thread counts for manual splitting/threading, though it’s much harder to implement.