# Phase Vocoder

A spectral algorithm for independent time and pitch manipulation.

Uses discrete Fourier transform (DFT / FFT).

# 1 Time Stretch

# 1.1 Parameters

These can be tuned to taste.

\(M\): input audio block size (integer, eg 8192).

\(\Delta I\): input audio hop size (integer, eg 19).

\(N\): zero-padded block size (for DFT, integer, eg 262144).

\(\Delta O\): output audio hop size (integer, eg 432).

The time dilation factor is \(\frac{\Delta O}{\Delta I}\).

# 1.2 Algorithm

Take \(M\) samples of input audio every \(\Delta I\) samples.

Multiply by raised cosine window of length \(M\) (peak amplitude \(2\), mean \(1\)).

Zero-pad to length \(N\), call it \(x(n)\) where \(n\) is the block index in \(0, 1, 2, \ldots\).

Take the discrete Fourier transform of \(x(n)\), call it \(X(n)\).

For each bin, normalize the (complex-valued) ratio \(\frac{X(n)}{X(n - 1)}\) to magnitude \(1\), and raise it to the power \(\frac{\Delta O}{\Delta I}\) (which need not be an integer). Call the result \(\delta \theta(n)\). In case of division by zero or other badness, set \(\delta \theta(n) = 1\).

Increment the phase of each bin by \(\theta(n) = \theta(n - 1) \times \delta \theta(n)\) and normalize (just to be safe in case of rounding errors). Phase of \(\theta(-1)\) is probably arbitrary but should have magnitude \(1\).

Then the output Fourier transform has the input’s magnitude with the accumulated phase: \(Y(n) = |X(n)| \times \theta(n)\).

Take the inverse Fourier transform of \(Y(n)\), call it \(y(n)\).

Multiply by raised cosine window of length \(M\) (peak amplitude \(2\), mean \(1\)).

Multiply by gain factor: \(G = \frac{\Delta O}{M \times N}\). This assumes that the gain of the DFT/FFT is not normalized.

Accumulate \(M\) samples spaced every \(\Delta O\) samples to output audio stream (overlap-add).

# 1.3 References

  • Pure-data documentation 3.audio.examples/I07.phase.vocoder.pd (Pd version 0.53).