Making KF fast again
Back in 2017 I forked the Windows fractal explorer software Kalles Fraktaler 2. I've been working on it steadily since, adding plenty of new features (and bugs). My fork's website is here with binary downloads for Windows (including Wine on Linux).
I had been maintaining 3 branches of various ages, purely because the 2.12 branch was faster than the 2.13 and 2.14 branches and I couldn't figure out why, until recently. Hence this blog post. It turns out to be quite obscure. This is the patch that fixed it:
diff --git a/formula/formula.xsl b/formula/formula.xsl index b47f763..d07c002 100644 (file) --- a/formula/formula.xsl +++ b/formula/formula.xsl @@ -370,6 +370,7 @@ bool FORMULA(perturbation,<xsl:value-of select="../@type" />,<xsl:value-of selec (void) Ai; // -Wunused-variable (void) A; // -Wunused-variable (void) c; // -Wunused-variable + bool no_g = g_real == 1.0 && g_imag == 1.0; int antal = antal0; double test1 = test10; double test2 = test20; @@ -385,7 +386,14 @@ bool FORMULA(perturbation,<xsl:value-of select="../@type" />,<xsl:value-of selec Xxr = Xr + xr; Xxi = Xi + xi; test2 = test1; - test1 = double(g_real * Xxr * Xxr + g_imag * Xxi * Xxi); + if (no_g) + { + test1 = double(Xxr * Xxr + Xxi * Xxi); + } + else + { + test1 = double(g_real * Xxr * Xxr + g_imag * Xxi * Xxi); + } if (test1 < Xz) { bGlitch = true;
In short, it adds a branch inside the inner loop, to avoid two multiplications by 1.0 (which would leave the value unchanged). Normally branches inside inner loops are harmful for optimization, but because the condition is static and unchanging over the iterations, the compiler can actually reverse the order of the loop and branch, generating code for two loops, one of which has the two multiplications completely gone. In real-world usage, the values are almost always both 1.0 - they determine which parts of the value to use for the escape test (and glitch test, but this is probably a bug).
The performance boost from this patch was about 20% (CPU time), which is huge in the grand scheme of things, so I was quite happy, because it brought performance of kf-2.14.7.1 back to the level of the 2.12 branch, so I don't have to support it any more (by backporting bugfixes).
But when you get a taste for speed, you want more. So far KF has not taken advantage of CPUs to their fullest. Until now, KF has been resolutely scalar, computing one pixel at a time in each thread. Last night I started work on upgrading KF to use vectorization (aka SIMD). Now when I compile KF for my CPU (which is not portable, so I won't ship binaries with these flags enabled), I get an 80% (CPU time) speed boost, which is absolutely ginormous, and when compiling for more conservative CPU settings (Intel Haswell / AMD Excavator) the speed boost is 61% which is still a very nice thing to have. With no CPU specific flags (baseline x86_64) the speed boost is 55% which is great too.
The vectorization work is not finished yet, so far it is only added
for "type R" formulae in double
precision (which allows zoom
depths to 1e300 or so). Unfortunately long double
(used after
double
until 1e4900 or so) has no SIMD support at the hardware
level, but I will try to add it for the floatexp
type used
for even deeper zooms (who knows, maybe floatexp
+SIMD will
be competitive with long double
, but I doubt it...). I will
also add support for "type C" formulae before the release, which is a little
complicated by the hoops you have to jump through to get gcc to broadcast
a scalar to a vector in initialization.
Here's a table of differently optimized KF versions:
version | vector size | wall-clock time | CPU time | speed boost | |
---|---|---|---|---|---|
2.14.7.1 | 1 | 3m47.959s | 23m30.676s | 1.00 | 1.00 |
git/64 | 1 | 3m46.703s | 23m26.290s | ||
2 | 3m22.280s | 15m11.022s | 1.13 | 1.55 | |
4 | 3m55.158s | 25m26.638s | |||
git/64+ | 1 | 3m46.977s | 23m26.065s | ||
2 | 3m13.442s | 14m34.363s | 1.18 | 1.61 | |
4 | 3m26.012s | 14m54.546s | |||
git/native | 1 | 3m42.554s | 21m51.381s | ||
2 | 3m10.440s | 13m26.100s | |||
4 | 3m08.784s | 13m06.386s | 1.21 | 1.80 | |
8 | 3m50.812s | 24m01.230s |
All these benchmarks are with Dinkydau's "Evolution of Trees" location, quadratic Mandelbrot set at zoom depth 5e227, with maximum iteration count 1200000. Image size was 3840x2160. My CPU is an AMD Ryzen 7 2700X Eight-Core Processor (with 16 threads that appear as distinct CPUs to Linux). Wall-clock performance doesn't scale up as much as CPU because some parts (computing reference orbits) are sequential; only the perturbed per-pixel orbits are embarassingly parallel.