Optimizing audio DSP code

The current work-in-progress version of GraphGrow has three components that communicate via OSC over network. The visuals are rendered in OpenGL using texture array feedback, this process is graphgrow-video. The transformations are controlled by graphgrow-iface, with the familiar nodes and links graphical user interface. The interface runs on Linux using GLFW, and I'm working on an Android port for my tablet using GLFM. The component I'll be talking about in this post is the graphgrow-audio engine, which makes sounds using an audio feedback delay network with the same topology as the visual feedback network. Specifically, I'll be writing up my notes on what I did to make it around 2x as CPU efficient, while still making nice sounds.

First up, I tried gprof, but after following the instructions I only got an empty profile. My guess is that it doesn't like JACK doing the audio processing in a separate realtime thread. So I switched to perf:

perf record ./graphgrow-audio
perf report

Here's the first few lines of the first report, consider it a baseline:

Overhead  Command          Shared Object        Symbol
  34.41%  graphgrow-audio  graphgrow-audio      [.] graphgrow::operator()
  18.11%  graphgrow-audio  libm-2.24.so         [.] expm1f
  14.27%  graphgrow-audio  graphgrow-audio      [.] audiocb
   8.69%  graphgrow-audio  libm-2.24.so         [.] sincos
   8.34%  graphgrow-audio  libm-2.24.so         [.] __logf_finite
   4.47%  graphgrow-audio  libm-2.24.so         [.] tanhf

That was after I already made some algorithmic improvements: I had 32 delay lines, all of the same length, with 64 delay line readers in two groups of 32, each group reading at the same offset every sample. This meant there was a lot of duplicated work calculating the delay line interpolation coefficients. I factored out the computation of the delay coefficients into another struct, which could be calculated 1x per sample instead of 32x per sample. Then the delay readers are passed the coefficients, instead of computing them themselves.

Looking at what to optimize, the calls to expm1f() seem to be a big target. Looking through the code I saw that I had 32 dynamic range compressors, each doing RMS to dB (and back) conversions every sample, which means a lot of log and exp. My compressor had a ratio of 1/8, so I replaced the gain logic by a version that worked in RMS with 3x sqrt instead of 1x log + 1x exp per sample:

index a6ba512..588d098 100644
--- a/graphgrow3/audio/graphgrow-audio.cc
+++ b/graphgrow3/audio/graphgrow-audio.cc
@@ -549,18 +549,25 @@ struct compress
   sample factor;
   hip hi;
   lop lo1, lo2;
+  sample thresrms;
   compress(sample db)
   : threshold(db)
   , factor(0.25f / dbtorms((100.0f - db) * 0.125f + db))
   , hi(5.0f), lo1(10.0f), lo2(15.0f)
+  , thresrms(dbtorms(threshold))
   {
   };
   signal operator()(const signal &audio)
   {
     signal rms = lo2(0.01f + sqrt(lo1(sqr(hi(audio)))));
+#if 0
     signal db = rmstodb(rms);
     db = db > threshold ? threshold + (db - threshold) * 0.125f : threshold;
     signal gain = factor * dbtorms(db);
+#else
+    signal rms2 = rms > thresrms ? thresrms * root8(rms / thresrms) : thresrms;
+    signal gain = factor * rms2;
+#endif
     return tanh(audio / rms * gain);
   };
 };

This seemed to work, the new perf output was:

Overhead  Command          Shared Object       Symbol
  38.89%  graphgrow-audio  graphgrow-audio     [.] graphgrow::operator()
  22.11%  graphgrow-audio  libm-2.24.so        [.] expm1f
  10.78%  graphgrow-audio  libm-2.24.so        [.] sincos
   5.76%  graphgrow-audio  libm-2.24.so        [.] tanhf

The numbers are higher, but this is actually an improvement, because if graphgrow::operator() goes from 34% to 39%, everything else has gone from 66% to 61%, and I didn't touch graphgrow::operator(). Now, there are still some large amounts of expm1f(), but none of my code calls that, so I made a guess: perhaps tanhf() calls expm1f() internally? My compressor used tanh() for soft-clipping, so I tried simply removing the tanh() call and seeing if the audio would explode or not. In my short test, the audio was stable, and CPU usage was greatly reduced:

Overhead  Command          Shared Object       Symbol
  60.53%  graphgrow-audio  graphgrow-audio     [.] graphgrow::operator()
  17.62%  graphgrow-audio  libm-2.24.so        [.] sincos
  11.51%  graphgrow-audio  graphgrow-audio     [.] audiocb

The next big target is that sincos() using 18% of the CPU. The lack of 'f' suffix tells me this is being computed in double precision, and the only place in the code that was doing double precision maths was the resonant filter biquad implementation. The calculation of the coefficients used sin() and cos(), at double precision, so I swapped them out for single precision polynomial approximations (9th order, I blogged about them before). The approximation is roughly accurate (only a bit or two out) for float (24bits) which should be enough: it's only to control the angle of the poles, and a few cents (or more, I didn't check) error isn't so much to worry about in my context. Another big speed improvement:

Overhead  Command          Shared Object       Symbol
  85.48%  graphgrow-audio  graphgrow-audio     [.] graphgrow::operator()
  11.45%  graphgrow-audio  graphgrow-audio     [.] audiocb
   1.22%  graphgrow-audio  libc-2.24.so        [.] __isinff
   0.64%  graphgrow-audio  libc-2.24.so        [.] __isnanf
   0.41%  graphgrow-audio  graphgrow-audio     [.] graphgrow::graphgrow

perf has a mode that annotates the assembly and source with hot instructions, looking at that let me see that the resonator was using double precision sqrt() when calculating the gain when single precision sqrtf() would be enough:

Overhead  Command          Shared Object       Symbol
  88.70%  graphgrow-audio  graphgrow-audio     [.] graphgrow::operator()
   8.49%  graphgrow-audio  graphgrow-audio     [.] audiocb
   1.24%  graphgrow-audio  libc-2.24.so        [.] __isinff
   0.65%  graphgrow-audio  libc-2.24.so        [.] __isnanf

Replacing costly double precision calculations with cheaper single precision calculations was fun, so I thought about how to refactor the resonator coefficient calculations some more. One part that definitely needed high precision was the calculation of 'r = 1 - t' with t near 0. But I saw some other code was effectively calculating '1 - r', which I could replace with 't', and make it single precision. Again, some code was doing '1 - c * c' with c the cosine of a value near 0 (so 'c' is near 1 and there is catastrophic cancellation), using basic trigonometry this can be replaced by 's * s' with s the sine of the value. However, I kept the final recursive filter in double precision, because I had bad experiences with single precision recursive filters in Pure-data (vcf~ had strongly frequency-dependent ring time, porting to double precision fixed it).

Overhead  Command          Shared Object       Symbol
  87.86%  graphgrow-audio  graphgrow-audio     [.] graphgrow::operator()
   9.18%  graphgrow-audio  graphgrow-audio     [.] audiocb
   1.37%  graphgrow-audio  libc-2.24.so        [.] __isinff
   0.68%  graphgrow-audio  libc-2.24.so        [.] __isnanf

The audiocb reponds to OSC from the user interface process, it took so much time because it was busy-looping waiting for the JACK processing to be idle, which was rare because I still hadn't got the CPU load down to something that could run XRUN-free in realtime at this point. I made it stop that, at the cost of increasing the likelihood of the race condition when storing the data from OSC:

Overhead  Command          Shared Object       Symbol
  96.87%  graphgrow-audio  graphgrow-audio     [.] graphgrow::operator()
   1.49%  graphgrow-audio  libc-2.24.so        [.] __isinff
   0.80%  graphgrow-audio  libc-2.24.so        [.] __isnanf

Still not running in realtime, I took drastic action, to compute the resonator filter coefficients only every 64 samples instead of every 1 sample, and linearly interpolate the 3 values (1x float for gain, 2x double for feedback coefficients). This is not a really nice way to do it from a theoretical standpoint, but it's way more efficient. I also check for NaN or Infinity only at the end of each block of 64 samples (if that happens I replace the whole block with zeroes/silence), which is also a bit of a hack - exploding filters sound bad whatever you do to mitigate it, but I haven't managed to make it explode very often.

Success: now it was using 60% of the CPU, comfortably running in real time with no XRUNs. So I added in a (not very accurate, but C2 continuous) rational approximation of tanh() to the compressor that I found on musicdsp (via Pd-list):

signal tanh(const signal &x)
{
  signal x2 = x * x;
  return
    x < -3.0f ? -1.0f :
    x >  3.0f ?  1.0f :
    x * (27.0f + x2) / (27.0f + 9.0f * x2);
}

CPU usage increased to 72% (I have been doing all these tests with the CPU frequency scaling governor set to performance mode so that figures are comparable). I tried g++-8 instead of g++-6, CPU usage reduced to 68%. I tried clang++-3.8 and clang++-6.0, which involved some invasive changes to replace 'c?a:b' (over vectors) with a 'select(c,a,b)' template function, but CPU usage was over 100% with both versions. So I stuck with g++-8.

The last thing I did was an algorithmic improvement: I was doing 4x the necessary amount of work in one place. Each of 4 "rules" gets fed through 4 "edges" per rule, and each edge pitchshifted its rule down an octave. By shifting (ahem) the pitchshifting from being internal to the edge to being internal to the rule, I saved 15% CPU (relative) 10% CPU (absolute), now there are only 8 pitchshifting delay lines instead of 32.

In conclusion, today I brought down the CPU usage of graphgrow-audio down from "way too high to run in realtime", to "60% of 1 core", benchmarking on an Intel Atom N455 1.66GHz netbook running Linux in 32bit (i686) mode. A side channel may be lurking, as the CPU usage (in htop) of graphgrow-audio goes up to 80% temporarily while I regenerate my blog static site...