If you're wondering where all the action is, given that this blog has been quiet for half a year, I've been gradually populating a new area of my website:

I'm finding a less time-oriented site more useful for making incremental changes to pages.

]]>A couple of years ago I wrote about fixed-point numerics. The multiplication presented there, to multiply numbers with \(n\) limbs uses \(O(n^2)\) individual limb-by-limb multiplications, which isn't very efficient.

Multiplying two \(n\)-limb numbers gives a number with \(2n\) limb. For simplicity, assume \(n\) is a power of \(2\), say \(n = 2^k\). For \(x\) with \(n\) limbs, write \(\operatorname{hi}(x)\) to mean the top half of \(x\) with \(n/2\) limb, and \(\operatorname{lo}(x)\) to mean the bottom half of \(x\) with \(n/2\) limbs. Suppose the numbers represent fractions in \([0..1)\), that is, there are no limbs before the point and \(n\) limbs after. Full multiplication would give a fraction with \(2n\) limbs after the point, but maybe we don't care for the extra precision and truncating to the first \(n\) limbs would be perfectly fine. Maybe we don't even care about errors in the last few bits.

Writing \(x \times y\) for the full multiplication that gives \(2n\) bits, we want to compute \(\operatorname{hi}(x \times y)\) as efficiently as possible. To be precise, we want to minimize the number of limb-by-limb multiplications, because they are relatively expensive compared to additions.

As a baseline, here's a simple recursive implementation of full multiplication:

\(x \times y\) | \(=\) | \(\operatorname{hi}(x)\times\operatorname{hi}(y)\) | ... | ... | |

\(+\) | ... | \(\operatorname{hi}(x)\times\operatorname{lo}(y)\) | ... | ||

\(+\) | ... | \(\operatorname{lo}(x)\times\operatorname{hi}(y)\) | ... | ||

\(+\) | ... | ... | \(\operatorname{lo}(x)\times\operatorname{lo}(y)\) |

You can see that the cost of an \(n\)-limb multiply is \(4\) times the cost of an \(n/2\)-limb multiply, which means that the overall cost works out at \(n^2\) limb-by-limb multiplies.

Now assuming truncation and errors in the last few bits are fine, the recursive implementation becomes:

\(\operatorname{hi}(x \times y)\) | \(=\) | \(\operatorname{hi}(x)\times\operatorname{hi}(y)\) | ... | ... | |

\(+\) | ... | \(\operatorname{hi}(\operatorname{hi}(x)\times\operatorname{lo}(y))\) | ... | ... | |

\(+\) | ... | \(\operatorname{hi}(\operatorname{lo}(x)\times\operatorname{hi}(y))\) | ... | ... |

The cost of an \(n\)-limb truncated multiply is \(1\) times the cost of an \(n/2\)-limb full multiply plus \(2\) times the cost of an \(n/2\)-limb truncated multiply. Working out the first few terms \[1,3,10,36,136,528,\ldots\] shows that it has about half the cost of the full multiply's \[1,4,16,64,256,1024,\ldots\] and in fact the overall cost works out at \(n(n+1)/2\) limb-by-limb multiplies, a significant improvement! This truly is a clever trick!

However, for full multiplication there is a way to reduce multiplications even more dramatically. The trick is called Karatsuba multiplication and is based on the algebraic identity \[ (a + b)\times(u + v) = a \times u + b \times u + a \times v + b \times v \] Now set \(a = \operatorname{hi}(x), b = \operatorname{lo}(x), u = \operatorname{hi}(y), v = \operatorname{lo}(y) \) and notice that the four terms on the right are what we are computing with the first recursive implementation above.

Now rewrite the algebra to move terms around: \[ b \times u + a \times v = (a + b)\times(u + v) - (a \times u + b \times v) \] so we can replace two multiplications on the left by one multiplication on the right, with the two other multiplications on the right being things we need to compute anyway.

(There are two extra \(n/2\)-bit additions for \(a+b\) and \(u+v\), and one extra \(n\) bit-bit subtraction, which is guaranteed to give a non-negative result if all of \(a,b,u,v\) are non-negative, but the cost of limb-by-limb addition and subtraction are usually much less than limb-by-limb multiplication, and \(n\)-limb addition needs \(n\) limb-by-limb additions, which is small.)

This means the cost of \(n\)-limb full multiplication is \(3\) times the cost of \(n/2\)-limb full multiplication, the first terms are \[ 1,3,9,27,81,243,\ldots \] and the cost works out to \(n^{\log_2{3}}\), which is approximately \(n^{1.585}\). This reduced power (vs \(n^2\)) is an asymptotic improvement, while the truncation trick only improved constant factors. Example, suppose \(n = 256\), then original needs \(65536\) limb-by-limb multiplications giving \(512\) limbs, truncated needs \(32896\) limb-by-limb multiplications giving \(256\) limbs, Karatsuba needs \(6561\) limb-by-limb multiplications giving \(512\) limbs. So an order of magnitude improvement!

But! This assumes the truncated multiplication is using the original full multiplication. Replacing the full multiplication with the Karatsuba multiplication, we get the cost of \(n\)-limb truncated multiplication is \(1\) times the cost of \(n/2\)-limb Karatsuba multiplication, plus \(2\) times the cost of \(n/2\)-limb truncated multiplication. This works out as... exactly the same cost as \(n\)-limb Karatsuba multiplication, only with a less precise answer!

So the Karatsuba trick is very much cleverer than truncated multiply. A truncated multiply using Karatsuba might still be worth it: it needs half the space for output, and the number of limb-by-limb additions will be smaller. There are even more complicated multiplication algorithms out there than Karatsuba that get even better asymptotic efficiencies, go see how the GMP library does it.

Addendum: with 16-bit limbs, the cost of limb-by-limb multiply on a Motorola 68000 CPU (as present in Amiga A500(+)), is 54 cycles average case (worst case 70 cycles) add 12 cycles to load inputs into registers from memory), the cost of limb-by-limb addition with carry is 4 cycles (18 cycles if operating on memory directly, 30 cycles to operate on pairs of limbs (32-bits) in memory at a time). Reference: 68000 instruction timing.

]]>The Mandelbrot set has two prominent solid regions. There is a cardioid, which is associated with fixed point (period 1) attractors, and a circle to the left, which is associated with period 2 attractors. The rest of the cardioid- and circle-like components in the Mandelbrot set are distorted.

These shapes can be described as implicit functions. For example, the circle is centered on \(-1+0i\) and has radius \(\frac{1}{4}\), and the function \[C_2(x, y) = (x - (-1))^2 + (y - 0)^2 - \left(\frac{1}{4}\right)^2\] is negative inside, zero on the boundary, and positive outside, and the same applies to the more complicated function for the cardioid: \[C_1(x, y) = \left( \left(x - \frac{1}{4}\right)^2 + y^2 \right)^2 + \left(x - \frac{1}{4}\right) \left(\left(x - \frac{1}{4}\right)^2 + y^2\right) - \frac{1}{4} y^2 \]

These implicit functions can be used to accelerate Mandelbrot set rendering. You can test if each \(c=x+iy\) is in the cardioid or circle quickly and easily, saving iterating the pixel all the way to the maximum iteration count (being interior to the Mandelbrot set means iterations of \(z \to z^2 + c\) will never escape to infinity).

But we can accelerate further. If the whole viewing rectangle of a zoomed in view is far away from the circle and cardioid, then these per-pixel cardioid and circle tests are a waste of time, as they will never say they are inside. By analysing the coordinates of an axis-aligned bounding box (AABB) it's possible to decide when it's worth doing per pixel tests (that is, when the boundary of the shapes passes through the box - otherwise it's 100% interior or (more likely) 100% exterior to the shapes). This can save \(O(W H)\) work.

For example for the circle, if the lower edge of the box is above \(\frac{1}{4}\), or the upper edge of the box is below \(-\frac{1}{4}\), or the right edge of the box is left of \(-\frac{5}{4}\), or the left edge of the box is right of \(-\frac{3}{4}\), clearly it cannot overlap the circle. And if all corners are inside the circle, the whole box must be inside the circle. If some corners are inside and some are outside, then the boundary passes through.

But if all corners are outside it gets complicated: the box could be surrounding the whole circle, or a bulge of the circle could pop into an edge of the box. So the next step is to consider the vertices of the circle (the points most left/right/top/bottom): axis alignment means that if a bulge pops into the box, the vertex must lead the way. All in all there are many cases to consider, but it's not insurmountable.

Similarly for the cardioid, with the added complication that the vertices are not rational. However, squaring the coordinates does give dyadic rationals, so comparing \(y^2\) with \(\frac{3}{64}\) and \(\frac{27}{64}\) can do the trick.

For deep zooms, coordinates need high precision (lots of digits, most of which are the same for nearby pixels). Perturbation techniques mean using a high precision reference, with low precision differences to nearby points. This can also be applied to the implicit functions for the cardioid and circle: symbolically expand and cancel the large terms \(X, Y\) leaving only small terms of the scale of \(x, y\) in: \[c(X, Y, x, y) = C(X + x, Y + y) - C(X, Y)\] then evaluate \(C(X, Y)\) in high precision, round it to low precision, and add \(c(X, Y, x, y)\) evaluated in low precision. For accuracy, some coefficients in \(c\) will need to be calculated at high precision before rounding to low precision.

For example, the cardioid: \[c_1(X, Y, x, y) = a_x x + a_y y + a_{x^2} x^2 + a_{xy} xy + a_{y^2} y^2 + a_{x^3} x^3 + a_{x^2y} x^2y + a_{xy^2} xy^2 + a_{y^3} y^3 + x^4 + 2x^2y^2 + y^4 \] where \[a_x = (32XY^2+32X^3-6X+1)/8; a_y = (32Y^3+(32X^2-6)Y)/8; a_{x^2} = (16Y^2+48X^2-3)/8; \ldots \] I used wxMaxima to find all of the \(a\) coefficients, and calculate them one time per view using the reference, along with \(C_1(X, Y)\). For accuracy, with fixed point calculations you need about 4 times the number of fractional bits for intermediate calculations, and the values at low precision need to be relatively high accuracy (in my tests 24 bits was not enough to achieve good images, 53 seemed ok, and I use 64 bits just to be safe).

The previous discussion about rejecting interior checks for the whole view can be applied with perturbation too, but some magic numbers need to be calculated at high precision before rounding to low precision, namely the special points (vertices and cusps): \[ X + \frac{5}{4}, X + 1, X + \frac{3}{4}, X + \frac{1}{8}, X - \frac{1}{4}, X - \frac{3}{8} \] and \[ Y + \frac{1}{4}, Y - \frac{1}{4}, Y^2 - \frac{3}{64}, Y^2 - \frac{27}{64} \] the addends are all dyadic rationals so can be represented exactly in binary fixed point or floating point.

The circle and cardioid also have parametric forms. Here's the cardioid: \[C_1(t) = \left(\frac{\sin(t)^2-(\cos(t)-1)^2+1}{4}, \frac{(\cos(t)-1)\sin(t)}{2}\right) \] If you could work out the distance to the nearest point of the curve, then all views with a smaller circumradius and same center would be 100% exterior (or 100% interior). For the cardioid, considering that the dot product of the tangent of the curve at the nearest point and the vector from the point to the nearest point of the curve must be zero (perpendicular), it means solving \[(4\cos(t)-4\cos(2t))y+(4\sin(2t)-4\sin(t))x-\sin(t)=0\] which can be rearrange to a 9 or 10 degree polynomial in \(\tan(t)\) using trigonometric identities. This is altogether a hard problem to solve, most practical is bisection of the trigonometric form on segments between by \(t = k\frac{\pi}{3}\). Linear distance estimate using Taylor expansion gives a closed form \(d(x,y)\), but it's not accurate, especially near the cusp. Quadratic Taylor expansion gives a high degree polynomial to solve. On the other hand, exact distance to a circle is easy.

D IYaccedes guidera crady wildey reads, deline dobbie wiedeman speedily: derouin andresen dweeb maltreated freed’s, directors’ dwelley moldy hardegree. kedrowski dumpty bady audio, retired repealed debussy dirtiest; divinely ludcke mead’s dimeglio, abbreviated trudy handiest. richaud d’ivoire restaged shaheed ardeen, kolodziej streed comedians denyse? dravecky leonid adhesives dean, convened demeanours daughtery devries! restricted doshi defect kneading diel: intrepidly succeed rashid madill.

A sonnet is a poetry form with a certain meter and rhyme. The meter
(aka rhythmic stress) is iambic pentameter, which means there are five
pairs of weak-strong stress in each line. There are 14 lines, and
alternate pairs of lines rhyme with each other, in the pattern
`AbAb CdCd EfEf GG`

. Traditionally sonnets have a thematic
form too, but that's harder to encode algorithmically so I skipped that
part.

D UHmuhammed glenwood goulden nationhood, durante fullwood woodell isherwood: dimuro gooding’s lurid ravenswood, muhammad’s woodpile goodroe likelihood. decook demurely goodyear’s hollywood’s, honduras watwood fluoride waldenbooks; badour hondurans longwood neighborhoods, muhamed woodsmen’s debuhr buddenbrooks. midura durant footed arwood boord, amdura goodyear bullied fulford boord? damour boyhood endure suhud obscured, gourdine datura kerwood boord ensured! moored lueders detours woolard neighborhoods: muhammed’s detours murad footed good’s.

The Carnegie Mellon University Pronouncing Dictionary is an open-source machine-readable pronunciation dictionary for North American English, says their website. Each word is spelled with 39 phones (individual sounds like vowels and consonants), and the vowels have stress indications. I used it to generate sonnets, each with a focal pair of phones.

D Zdesigns editions frenzied exercised, defoe’s adviser’s hyde’s davanzo mead’s: lozada shoulders lloyds’ democratized, disclaims deserving ceredase accedes. pandora’s medgar’s towards thalidomide’s, withstands blonds border’s bethards teased deplores; renditions herbicides directives chides, demarzo deal’s dapuzzo underscores. agendas notepads doze surrounds arcades, commodities mcalexander digges? offends dubrovnik’s designation jades, medusas soundings sowards hundreds digs! d’souza gaydos fairgrounds equalized: pedroza downey’s wilds securitized.

The algorithm is stochastic. It fills words starting from the end of the line (constrained by rhyme) and works back towards the start of the line. Only words containing both phones are allowed, and words are weighted by how well they fit with the meter. Lines that deviate too far from the meter are rejected and retried; this means that sometimes it can take a long time to emit a complete poem. Some pairs of phones seem to have no poems possible at all, these pairs are forbidden when choosing phones.

IY UHarcuri surely juries bulkeley, qureshi pussycat tellurium: curie securely bookie woodmansee, purina pulleys bully pulliam. duryea woodie bogie livelihoods, manchuria mercuri likelihood; furini youri heathwood hollywood’s, bijur duryea beechwood hollywood. duryea reinsure centurion, venturi pussyfoot missourians? curie create-a-book venturian, mcgourty storybooks missourians! missouri woodby beechwood whoopdedoo: assemblywoman pussy whoopdedoo.

I emit the poems as LaTeX source code, for typesetting into a zine of 8 pages (6 poems as all pairs from 4 phones) or a book of 68 pages (66 poems as all pairs from 12 phones). Page counts are sometimes overshot because some poems have long words that cause line wrapping and page breaks. I set up a couple of cron jobs for these, publishing the PDFs in podcast RSS feeds with HTML index too.

IY Zzalesky knees gene’s airfields refsnes g.s, mulroney’s orleans jimmy’s finkelstein’s: achieves asean freas celebrity’s, removes repositories jees vaccines. jarriel’s viens evens legionaries skis, monzingo hargreaves brienza cassidy’s; falzone rustys cynthia’s degrees, louise pianos quito’s speakes’s ease. ariza antibodies shear’s mczeal, henriques entries manganese louisette? mackenzie licensees toshiba’s zele, analyses mcnees fifteens louisette! livan’s concedes farinas meehans z.’s: kathleen’s lamine’s bikinis carberry’s.

Then I added a plain text output mode, and used Festival Lite (flite) to do text-to-speech, and DarkIce to do internet streaming radio, with ALSA loopback device connecting the two (things like JACK and PipeWire are too complex for me to configure on my headless Raspberry Pi). Thanks to Lurk for the IceCast streaming server. Flite has a few voices to choose from, so I made each successive poem be read by a different one. The final touch was using CURL to set track title metadata.

UH Zmissouri’s goodyear’s murals fuller’s luehrs, curators busch’s tour’s moore’s bookmobiles: hondurans football’s tour’s chargeurs matures, nomura’s kuras wildwoods bookmobiles. durations voorhies bureau’s hoods endures, matures bookcellars fuller’s rookies full’s; muhammad’s bushings butchers jurors’ tour’s, bulldoze azura smoshes bush’s pulls. huzzah missouri’s goody’s cottonwoods, shultz’s curators tours brochures huzzah? secures erzurum voorhees moorland’s woods, curator’s lures missourian huzzah! durazo woolsey luhrs securities’: muhammed’s winwood’s wools securities.

You can check out the project and its source code here: mathr.co.uk/phones-at-home.

]]>(Actually 0.8.0.1 because Hackage complained about 0.8 in a way that
`cabal check`

didn't warn about before uploading.)

A new release of **mandulia** (a zooming visualization
of the Mandelbrot Set as many Julia Sets, with Lua scripting support)
is installable from
Hackage:

cabal update cabal install mandulia

After you have installed it, you can try it out by running:

mandulia main

and hit the 'a' key to enter attract mode or use the key controls listed in the README for interactive navigation. 'F11' toggles full screen mode, and 'ESC' quits.

mandulia-0.8.0.1 has no new features, the only changes are to make it work with the current state of Hackage 12 years later:

- base-4.6 has modifyIORef'
- containers-0.5 has Data.Map.Strict
- hslua-0.4 and above changed API, so restrict to older versions

Tested and working with latest stable versions of GHC from ghc-8.0.2 up, only ghc-9.4.2 needs --allow-newer=text because OpenGLRaw has an outdated dependency. Older versions of GHC may work, but I haven't tried. I have freeglut3-dev installed on Debian Bookworm (current testing distribution), but I don't know what other system dependencies are needed.

Source code is GPLv3+ licensed, Git repository at mandulia on code.mathr.co.uk, or you can download mandulia-0.8.0.1.tar.gz.

]]>Zoom videos are a genre of 2D fractal animation. The rendering of the final video can be accelerated by computing exponentially spaced rings around the zoom center, before reprojecting to a sequence of flat images.

Some fractal software supports rendering EXR keyframes in exponential map form, which *zoomasm* can assemble into a zoom video. zoomasm works from EXR, including raw iteration data, and colouring algorithms can be written in OpenGL shader source code fragments.

Exciting things are planned for version 4, but *zoomasm 3.1 "d'acord"* is a maintenance release, once compatible again with latest miniaudio at time of writing, and with some small bonus bug fixes.

Full release notes and downloads at mathr.co.uk/zoomasm

]]>Back in 2019 I experimented a bit with Haar wavelets for audio analysis and resynthesis. The main idea was splitting audio into octaves, calculating energy per octave, then doing the same for each octave over a longer time period, calculating energy per octave of rhythm frequencies for each octave of audible frequencies. I intended to use these rhythm fingerprints for discriminating (hence the name for the project, "Disco") between speech and music, but didn't get very far at the time (more recently I had success distinguishing between genres of music, using the rhythm fingerprints as input to an artifical neural network).

I made a small graphical interface for drawing rhythm fingerprints, called "disco/designer", with a web version, and played a gig or two using it as an instrument, but the main problem was that it was very high latency - it took so long between drawing a shape and the sound to change that it neither easy nor fun to play.

Meanwhile, in 2021 I made Deep Disco, which ported something similar to the rhythm fingerprint noise resynthesis to classic Amiga A500: with a 7 MHz CPU there're not many cycles spare, so the code has to be tight and optimized (I traded speed vs space by using many lookup tables). This version didn't use Haar wavelets, instead it used a variant of the Voss-Clarke-Gardner-McCartney pink noise generation algorithm. The pink noise algorithm runs in realtime, overlaying many noise streams at different octaves, picking the octave to change by counting the number of trailing zeros of the audio sample index.

Today I ported parts of the Deep Disco implementation back to disco/designer, allowing much lower-latency response in the sound from drawing in the graphical user interface, which makes playing with it much more fun. The texture of the sound is different (due to the different synthesis method), but some aspects of the 2019 version were bad apart from the latency (weird volume fades and jumps), and the 2022 version is responsive and usable as an instrument. I kept both web versions online at different URLs so you can compare yourself, see the Disco homepage for links and source code.

]]>15 years on from my
initial prototype,
**unarchive-1.0** is released. It's a bash script for
downloading items and collections from the Internet Archive. I use it
to maintain a local mirror of my favourite netlabels.

See the website at

mathr.co.uk/unarchive

for more details, including downloads, link to development repository and an HTML version of the manual page.

]]>There's a lot of buzz around things like DALL-E image generation, GitHub Copilot source code generation, and so on. I'm worried that they don't attribute their sources properly, which might lead to unintentional plagiarism (I call AI-assisted plagiarism plAIgiarism). I spent some time over the last month thinking about how to improve the situation, the main idea is that the machine learning model should keep track of what it is learning from in the training stage, so that it can attribute its influences properly in the generation stage.

The first machine learning model I tried was Markov Chains for text generation, where the probablities of the next output (character or word) depends on the recent history (previous characters or words). A Markov Chain can be generated from a text corpus, by counting the occurences of groups of characters or words. I used characters (in Haskell) or bytes (in Lua), which are not quite the same (some characters take up multiple bytes in the common UTF-8 encoding).

To add attribution to the basic Markov Chain implementation, I added an extra histogram of how often each source file in the corpus contained each group of characters. Then when emitting characters, I include this data. I wrote a bit of Javascript that collates the embedded histograms over the selected part of the output HTML, so you can explore the generated text's attribution in the browser.

Then I tried an artificial neural network for a classification task. After some false starts, I settled on classifying music by genre, using the rhythm fingerprints from my disco project. As attribution I decided to use releasing label, mainly because attributing individual artists might use too many resources (mainly memory), let alone individual tracks.

Adding attribution to the neural network turned out to be quite easy,
in the weight update step in training where `w += dw`

I now
also do `dwda += outer(dw, a)`

where `a`

is a
one-hot attribution vector. Then in the classification (feed forward)
step where `O = w @ I`

I do
`dOda = w @ dIda + I @ dwda`

. At the end I process the final
`dOda`

to give an indication of whether a particular source
was more significant than expected in generating the output
classification.

The attributive Markov Chain was much more satisfying, because of the way it worked as a generator - you could see snippets taken verbatim from particular sources, annotated with their attribution. The genre classifier is much less immediate - most of the attributions differ by tiny amounts, the fun part (is the classification more or less accurate than the genre in the track metadata?) is unrelated to attribution.

Project page with source code repository and example output: mathr.co.uk/attributive-machine-learning

]]>I wrote a short poem about learning physics:

multiverse poetry

now don't have a go at me

I know what you're thinking

I haven't been drinking

but now I'm inspired

by many world theory

and quantum mechanics:

did Schrödinger's cat die

and live, both somehow mixed?

I don't know, probably

I don't understand it

yet so I'll study up

and report back later.

Written a couple of weeks ago while travelling back from Live Performers Meeting in Rome (where mathr&netz played Ommatidia). The other passengers may have been bemused by me counting on my fingers and scribbling in a notebook.

]]>In my previous post I had a problem with some simple Haskell code that exploded in memory. This morning I worked out why, and how to fix it, using the technique of difference lists. The problematic code was:

import Control.Monad (replicateM) shapes bias p = [ s | m <- [0 .. p] , s <- replicateM (fromInteger m + 2) [1 .. p] , p == sum (zipWith (*) (map (bias +) s) (tail s)) ] main = mapM_ (print . length . shapes 0) [1..10]

To see how memory explodes, here's my session:

$ ghc -O2 replicateM.hs -rtsopts ... $ timeout 60 ./replicateM +RTS -h 1 3 5 10 14 27 37 $ hp2pretty replicateM.hp

and here is the SVG graph output by hp2pretty:

The memory is going up dramatically, and while the program ran for a whole minute wall-clock, subtracting the garbage collection time gives less than 17 seconds of useful work.

I had a hunch that the problem is `replicateM`

(the rest
of the code is so simple that it almost obviously can't be anything
else). I'm using for lists, where it does Cartesian product (output all
possible combinations of one item from each of cnt0 copies of a list f).
Let's look at the definition in the source code:

replicateM cnt0 f = loop cnt0 where loop cnt | cnt <= 0 = pure [] | otherwise = liftA2 (:) f (loop (cnt - 1))-- base-4.16.1.0:Control.Monad.replicateM

To intuitively understand liftA2, one can desugar it to do-notation,
which changes the type unless
`{-# LANGUAGE ApplicativeDo #-}`

is enabled:

replicateM cnt0 f = loop cnt0 where loop cnt | cnt <= 0 = pure [] | otherwise = do x <- f xs <- loop (cnt - 1) pure (x : xs)

but to see where the leak is coming from we can evaluate the critical
part of the original version, specifically for the `[String]`

type:

> result = words "hello world how are you" > mapM_ putStrLn $ liftA2 (:) "¿¡" result ¿hello ¿world ¿how ¿are ¿you ¡hello ¡world ¡how ¡are ¡you

Immediately one can see that it needs all of `result`

in
order to print it once with a prefix, but then it needs it all again to
print it with the second prefix. So because `loop (cnt - 1)`

is passed as an argument to `liftA2`

, its value will be
shared in the same way as the `result`

list of strings was.
And sharing in this way forces the value to be kept in memory, and can
only start to be freed when the last prefix has started being emitted.

The problem in `replicateM`

is compounded, because the same
happens in each recursive call to `loop`

, although the
retained values get smaller deeper down so its not such a huge deal.
The problem is at the top of the recursion, where it has to store a list
of length ((length f)^(n-1)), which can get pretty large.

So that's the issue, how to fix it? Difference lists traditionally
work by turning appends like `((x ++ y) ++ z) ++ w`

, which is
expensive because (++) cost is the length of the left hand side, and left
nesting means the left hand side gets longer and longer, into something
like
`(((prepend x . prepend y) . prepend z) . prepend w) empty`

which is much more efficient because the cost of function composition
(.) is constant and prepend costs the length of its argument. In this
example the cost of the first is X + X + Y + X + Y + Z = 3 X + 2 Y + Z,
while the second is X + Y + Z + W (and the + W could be avoided by using
it at the end instead of prepending w to empty).

Here's what I came up with, I'm not 100% sure why it works better, and I'm not sure if it is even correct for anything apart from m = [], and my first attempt output items reversed, and maybe it has similarities to foldl' vs foldr in terms of accumulator as well as using difference lists, but anyway:

{-# LANGUAGE ApplicativeDo #-} replicateM' :: Applicative m => Int -> m a -> m [a] replicateM' n ls = go n (pure id) where go n fs | n <= 0 = do f <- fs pure (f []) | otherwise = go (n - 1) gs where gs = do f <- fs l <- ls pure (f . (l:))

Here's the heap profile graph for the same program, just using this replicateM' to replace replicateM:

What a difference! Tiny constant memory, as it should be, and almost none of the time is spent garbage collecting.

]]>On math.stackexchange.com, user Vincent asked a question that caught my attention:

I wanted to test different Networks with the same number of parameters but with different depths and widths.

A good introduction is in
Neural Networks and Deep Learning,
but in short: a neural network is essentially a chain of matrix
multiplies with non-linear "activation functions" in between (which
don't change the length of the vector of data passing through).
Matrices are 2D grids of numbers. Matrices can only be multiplied
together if the number of columns of the first matrix is equal to the
number of rows of the second matrix, so you can express the "shape" of
the neural network by a vector of positive integers, where each pair of
neighbouring values corresponds to the dimensions of one of the
matrices. The total number of parameters of the neural network is the
sum of the products of each such pair, for example the shape [3,7,5,1]
has size 3×7 + 7×5 + 5×1 = 61, which in the programming language Haskell
can be written `size = sum (zipWith (*) shape (tail shape))`

.

The question is essentially asking that, given the size, to construct some different shapes with that size, in particular shapes of different length (corresponding to different depths of network). But to see how the problem looks, I decided to generate all possible shapes for a given size, starting with some hopelessly naive Haskell code:

import Control.Monad (replicateM) shapes p = [ s | m <- [0..p] , s <- replicateM (m + 2) [1..p] , p == sum (zipWith (*) s (tail s)) ] main = mapM_ (print . length . shapes) [1..]

This code does "work", but as soon as the size p gets large, it takes forever and runs out of memory. On my desktop which has 32GB of RAM, I can only print 8 terms before OOM, which takes about 6m37s. These terms are:

1, 3, 5, 10, 14, 27, 37, 65

So a smarter solution is needed. I decided to implement it in C, because it's easier to do mutation there than Haskell. The core of the algorithm is the same as the Haskell above, but with one important addition: pruning. If the sum exceeds p before the end of the shape vector is reached, it doesn't make any difference what the suffix is: because all the dimensions are positive, the sum can never get smaller again.

I loop through depths (length of shape) until the maximum depth, as in the Haskell, and for each depth I start with a shape of all 1s, with last element 0 as an exception (it will be incremented before use). Each iteration of a loop, add 1 to the last element of the shape, if it gets bigger than the target p I set it back to 1 and propagate a carry 1 to the previous element (and so on). If the carry propagates beyond the first element, that means we've searched the whole shape space and we exit the loop.

Pruning is implemented by accumulating the sum from the left. If the sum of the first P products exceed the target, then set the whole shape vector starting from the (P+1)th index to the target, so at the next iteration of the loop, the last one is incremented and they all wrap around to 1, with the Pth item eventually incremented by 1. If the sum of all the products is equal to P, increment a counter (which is output at the end). I verified that the first 8 terms output with pruning matched the 8 terms output by my Haskell (without pruning), which is not a rigourous proof that the pruning is valid, but does increase confidence somewhat.

Because the C algorithm uses much less space than the Haskell (I do not know why that is so bad), and is much more efficient (due to pruning), it's possible to calculate many more terms. So much so that issues of numeric overflow come into play. Using unsigned 8 bit types for numbers allows only 11 terms to be calculated, because the counter overflows (term 12 is 384 > 2^8-1 = 255). The terms increase rapidly, so I decided to use 64 bits unsigned for the counter, which should be enough for the forseeable future (and just in case, I do check for overflow (the counter would wrap above 2^64-1) and report the error).

For the other values like shape dimensions I used the C preprocessor
with macro passed in at compile time to choose the number of bits used,
and check each numeric operation for overflow. For example, with 8 bits
trying to calculate the number with p = 128 fails almost immediately,
because the product 128 * 2 should be 256 > 2^8-1. Overflow checking is
coming soon to the C23 standard library, but for older compilers there
are `__builtin_add_overflow(a,b,resultp)`

and
`__builtin_mul_overflow(a,b,resultp)`

that do the job in the
version of gcc that I have.

However, even with all these optimisations it's still really slow,
because it takes time at least the order of the output count (because it
is only incremented by 1 each time), and the output count grows rapidly.
It took around 2 hours to calculate the first 45 terms. Just by
counting lines as the number of digits increases, I could see that
increasing the size by 5 multiplies the count by about 10, so the
asymptotics are about O(10^{p/5}). Here's a plot:

Two hours to calculate 45 terms is terrible, and I don't really need to calculate the actual shapes if I'm only interested in how many there are. So I started from scratch: how does the count change when you combine shapes. To do this I scribbled some diagrams, at first trying to combine two arbitrary shapes end-to-end, but that ended in failure. Success came when I considered the basic shape of length 2 (a single matrix) and considered what happens when appending an item. Then I made this work in reverse. Switching back to Haskell because it has good support for (unbounded) Integer, and good memoization libraries, I came up with this:

import Data.MemoTrie -- package MemoTrie on Hackage -- count shapes of length n with p parameters that start with a and end with y count = mup memo3 $ \n p a y -> case () of _ | n <= 1 || p <= 0 || a <= 0 || y <= 0 -> 0 -- too small | n == 2 && p /= a * y -> 0 -- singleton matrix mismatch | n == 2 && p == a * y -> 1 -- singleton matrix matches | otherwise -> -- take element y off the end leaving new end x sum [ count (n - 1) q a x | x <- [1 .. p] , let q = p - y * x , q > 0 ] total p = sum [ count n p a y | n <- [2.. p + 2], a <- [1 .. p], y <- [1 .. p] ]

This works so much faster, that 45 terms takes about 10 seconds using about 0.85 GB of RAM (and the results output are the same). Calculating just the 100th term (which is 28457095794860418935) took about 220 seconds using about 8.8 GB of RAM, but if you calculate terms in sequence the memoization means values calculated earlier can be reused, speeding the whole thing up: calculating the 101th term (which is 44259654087259419852) as well as the 100th term in one run of the program took about 260 seconds using about 16.4 GB of RAM. Calculating the first 100 terms in one run took about 390 seconds using about 20 GB of RAM.

A long-winded digression about fitting and residuals without the images that would make it comprehensible - gnuplot crashed before I could save them, losing its history in the process...

Using gnuplot I fit a straight line to the (natural) logarithm of the data points, which matched up pretty well, provided I skip the first few numbers (I'm only really interested in the asymptotics for large x so I think that's a perfectly reasonable thing to do):

gnuplot> fit [20:100] a*x+b "data.txt" using 0:(log($1)) via a, b ... Final set of parameters Asymptotic Standard Error ======================= ========================== a = 0.441677 +/- 1.781e-06 (0.0004033%) b = 1.06893 +/- 0.0001137 (0.01064%) ... gnuplot> print a 0.441677054485047 gnuplot> print b 1.0689289118356

Investigating the residuals showed an interesting pattern, they oscillate around zero getting smaller rapidly, until about x=35, after which they're all smaller than 0.000045. But then they are positive for a while and gradually decreasing to a minimum at x=73 or so, then the sign changes and they start increasing again. I thought with a better fit curve the oscillations would continue getting smaller, but I'm not sure how much more data I need. With the upper fit range limit set to 100, the asymptotic standard errors decrease by about a factor of 10 when I raise the lower fit range limit by 10. With the lower fit range limit set to 90, the earlier residuals are similar to those with the wider fit range limit, and the later residuals continue to oscillate getting smaller exponentially (magnitudes form a straight line on a graph with logarithmic scale).

gnuplot> fit [90:100] a*x+b "data.txt" using 0:(log($1)) via a, b ... Final set of parameters Asymptotic Standard Error ======================= ========================== a = 0.441676 +/- 4.801e-12 (1.087e-09%) b = 1.06901 +/- 4.539e-10 (4.246e-08%) ... gnuplot> print a 0.441675976841257 gnuplot> print b 1.0690075089073

Fitting a curve to the residuals seemed to improve things a bit, the final form of the function I came up with was something like:

\[ \exp(a p + b + (-1)^p \exp(c p + d)) \]

with a and b as above and

c = -0.246716071616843 d = -1.18363693943899

The residuals of that with the data show no simple pattern - there is some low frequency oscillation at the left end? Not sure. I still don't really know enough statistics to analyze this kind of thing.

Moving back to the original problem, neural networks often have a bias term. This means an extra element that is always 1 is appended to the input vector of each stage, but not the output. This means that for example the matrices in the shape [a,b,c] would have dimensions (a+1)×b and (b+1)×c, instead of a×b and b×c without bias.

In Haskell one would calculate
`sum (zipWith (*) (map (+ 1) s) (tail s))`

, even though the
naive Haskell is slow I did it to get the first few terms for validation
purposes. I added it to the C version, and got some more terms, and
finally updated the fast Haskell version. Here are the first few terms
as calculated by naive Haskell before it ran out of memory:

0, 1, 1, 3, 2, 6, 5, 9

which match those calculated by C which isn't terribly slow here, at
least to start with, because the growth of the output is much less
rapid, more like O(10^{p/9}):

The fast (memoizing) Haskell took about 420 seconds to calculate the first 100 terms using about 19.7 GB of RAM. It took about 17 seconds to calculate the first 50 terms using about 0.82 GB of RAM. The C took about 15 seconds (slightly faster) to calculate the first 50 terms, using less than about 1.4 MB of RAM (almost nothing in comparison). The 50th term is 538759 (both programs match), the 100th term is 240740950572. The C slows down by at least the size of the output, which is exponential in the input, so I didn't wait to see how long it would take to calculate 100 terms.

Here you can download the code for these experiments, and the output data tables:

- Makefile
- exhaustive.hs slow Haskell that exhausts RAM
- pruning.c slow C that needs almost no RAM
- memoizing.hs fast Haskell that needs lots of RAM
- unbiased.txt 100 terms starting 1, 3, 5, 10, ... (bias = 0)
- biased.txt 100 terms starting 0, 1, 1, 3, ... (bias = 1)
- plots.gnuplot source code for plots
- unbiased.png logarithmic plot (100 terms, bias = 0)
- biased.png logarithmic plot (100 terms, bias = 1)