# # Fast Fourier Transform

An asymptotically-improved $$O(N \log N)$$ algorithm for implementing the Discrete Fourier Transform (naively $$O(N^2)$$).

## # 1 Libraries

• FFTW, fast, but CPU only; also some possible issues with concurrency in shared plugin hosts.

## # 2 Decimation In Time

Define $$DFT(N/T, t, x)$$ to be the $$T$$ size $$N/T$$ DFTs with stride offset $$t \in [0, T)$$.

Then the $$DFT(N/1, 0, x)$$ can be decomposed into $$T$$ smaller DFTs thus:

\begin{aligned} X &= DFT(N/1, 0, x) \\ X_k &= \sum_{n=0}^{N-1} e^{- 2 \pi i n k / N} x_n \\ &= \sum_{t=0}^{T-1} \sum_{m=0}^{N/T-1} e^{- 2 \pi i (T m + t) k / N} x_{T m + t} \\ &= \sum_{t=0}^{T-1} e^{-2 \pi i k t / N} \sum_{m=0}^{N/T-1} e^{- 2 \pi i k m / (N / T)} x_{T m + t} \\ &= \sum_{t=0}^{T-1} e^{-2 \pi i k t / N} DFT(N/T, t, x) \\ Y_{t,k} &= DFT(N/T, t, x) \\ X_k &= \sum_{t=0}^{T-1} e^{-2 \pi i k t / N} Y_{t, k} \\ \end{aligned}

Continuing, it turns out the recombination is in fact a set of size $$T$$ DFTs:

\begin{aligned} X_k &= \sum_{t=0}^{T-1} e^{-2 \pi i k t / T} \frac{e^{-2 \pi i k t / N}}{e^{-2 \pi i k t / T}} Y_{t, k} \\ Z_{t,k} &= \frac{e^{-2 \pi i k t / N}}{e^{-2 \pi i k t / T}} Y_{t, k} \\ X_k &= \sum_{t=0}^{T-1} e^{-2 \pi i k t / T} Z_{t. k} \\ &= DFT(T, 0, Z_{t, k}) \end{aligned}

That is, X, the length N DFT of x, can be calculated by first doing T-many length N/T DFTs interleaved of x, multiplying by twiddle factors, then recombining with N/T-many length T DFTs. Inverse DFT follows the same process in reverse.

Applying this recursively gives an implementation of the Fast Fourier Transform algorithm, reducing the cost of DFT from $$O(N^2)$$ to $$O(N \log N)$$.

## # 3 Parallel FFTW

Once you have “planned” your FFT using the FFTW library, you can execute it from another thread. One strategy for large FFTs is to split it into a set of smaller FFTs as above, and execute each in parallel (perhaps using OpenMP).

I implemented it in a phase vocoder time stretcher:

16 threads is fastest in terms of waiting, but is about 35% less efficient in terms of output duration vs CPU seconds than when using 2 threads (the minimum my current code supports). 8 threads is probably the sweet spot: 15% less efficient than 2 threads with waiting times 3x shorter (fast enough for realtime streaming).

# threads   output/CPU seconds   realtime speed
2           0.21                 0.420
4           0.20                 0.746
8           0.18                 1.232
16          0.13                 2.015


Compared with using FFTW3’s built-in OpenMP support:

# threads   output/CPU seconds   realtime speed
1           0.285                0.285
2           0.251                0.482
4           0.208                0.762
8           0.156                1.149
16          0.125                1.259


Throughput is higher at higher thread counts for manual splitting/threading, though it’s much harder to implement.