Realtime Monotone
After instrumenting Monotone with OpenGL timer queries I could see where the major bottleneck lay:
IFS( 7011.936000 ) FLAT( 544.672000 ) PBO( 2921.728000 ) SORT( 6797.760000 ) LUP( 71136.064000 ) TEX( 284.224000 ) DISP( 272.480000 )
LUP is the per-pixel binary search lookup for histogram equalisation (to compress the dynamic range of the HDR fractal to something suitable for display), the previous SORT generates the histogram from a 4x4 downscaled image. A quick calculation shows that this LUP is taking 80% of the GPU time, so is a good focus for optimisation efforts.
The 4x4 downscaled image for the histogram is still a lot of pixels: 129600. LUP involves finding an index into this array, which gives a value with around 17bits of precision. However, typical computer displays are only 8bit (256 values) so the extra 9 random-access texture lookups per pixel to get a more accurate value are a waste of time and effort. Combined with a reduction of the downscaled image to 8x8, the optimisation to compute a less accurate (but visually indistinguishable) histogram equalisation allows Monotone to now run at 30fps at 1920x1080 full HD resolution. Here are the post-optimisation detailed timing metrics:
IFS( 7087.104000 ) FLAT( 509.888000 ) PBO( 2744.864000 ) SORT( 1409.440000 ) LUP( 15696.352000 ) TEX( 281.472000 ) DISP( 290.848000 )
A productive day!