## Fixed-point numerics

Computerized representations of numbers come in a few different flavours.
The most common is an `int`

, which is typically 4 bytes long, or
32 bits. A bit is either 0 or 1, and a byte is the smallest unit addressable
by the CPU and consists of 8 bits, able to store 2^{8} or 256 distinct
values. An `unsigned int`

stores numbers from `0`

to
`2`

, and a ^{32}-1`signed int`

uses
twos-complement to represent negative numbers: the most significant bit is
`1`

for negative numbers, and *sign extension* means it
really represents an infinite stream of `1`

s extending to the left.
You can find the twos-complement of a number by inverting all the bits and then
add 1 to the whole number.

Computers also have `float`

numbers, which have a floating point,
like scientific notation for numbers. The number of significant digits is fixed,
but you can represent very large numbers and very small numbers by combining the
mantissa digits with an exponent, which moves the point separating the fractional
part. I'll talk more about this in a future post. Floating point typically uses
a magnitude-and-sign representation instead of twos-complement.

Fixed-point numbers are somewhere in between, they can represent numbers with fractional parts, but the number of digits after the point is fixed, so very small numbers cannot be represented as accurately as numbers with magnitude near 1. For some purposes this is just fine, as the numbers you are concerned with always have a magnitude near 1. For example, when calculating the Mandelbrot set, each pixel's c value is between -2-2i and +2+2i, and the interesting parts have |c|>0.25 too. The escape time algorithm iterates z→z²+c starting from 0 until |z|>2 (or you give up after an arbitrary iteration count limit) which means most the z values have magnitude near 1. This makes high precision fixed-point a good fit: you need more digits when zooming in, otherwise you can't represent neighbouring pixels' c values, but you don't need the more-complex algorithms of floating-point numerics because you don't need to represent very small or very large numbers.

So how do you implement fixed-point numerics on a computer? I used
`unsigned int`

limbs, which are essentially digits in base 2^{32},
and stored numbers as an array of limbs and two numbers counting the number of
limbs before the point and the number of limbs after the point. My code is quite
general, with mixed precision operations, but this aspect is not so well tested as
the Mandelbrot set calculations only need 1 limb before the point. The most primitive
operations are reading a limb from the vector, which does sign extension in
twos-complement for limbs larger than the stored array, and pads with 0 for smaller
limbs; and writing a limb to a vector, which has some assertions to crash fast
instead of corrupting memory with an out of bounds write.

**-a** is implemented by complementing all the limbs and adding 1
in the least significant place, propagating carries along towards the more significant
limbs. Carries can be calculated by comparing the result of addition with the
addends, if the result is smaller then it overflowed (`unsigned int`

arithmetic is modulo 2^{32}) and you need to add 1 to the next limb.

**a<<n** and **a>>n** can be calculated
by reading two adjacent limbs into a double-wide `unsigned long long int`

(there is disagreement over whether `long`

is 32bit or 64bit on
different systems, `long long`

is 64bit almost everywhere I think) and
shift the correct 32bit part out of it while looping over the arrays, being careful
not to clobber the array you are reading from in case it is aliased to the array
you are writing to (it is nice to be able to shift in-place to save memory).

**a+b** is implemented by zipping from least significant limb to
most significant, propagating carries along and adding them to the next pair of
limbs. **a-b** is implemented similarly, only propagating borrows
to subtract from the next pair. Borrow is calculated by the subtracted limb being
larger than the limb from which it is subtracted. OpenGL 4 / GLSL 400 and above
have `uaddCarry()`

and `usubBorrow()`

functions which make
this easier to implement, but I wrote my implementation in OpenCL / C / C++ which
afaik doesn't have these.

**a ^{2}** is implemented as an optimization of multiplication
taking advantage of symmetry to optimize duplicated work like

`x*y+y*x`

as `(x*y)<<1`

which may make a small difference. Multiplication
**a*b**is the hardest algorithm to implement (of these so far, I didn't try division yet). The naive way of doing it is to multiply every limb in

`a`

by every limb in `b`

and collect up the results, so that is what I did as
a first pass for simplicity (more efficient algorithms like Karatsuba multiplication
may be interesting to investigate later). It differs from integer multiplication as
found in libgmp (for example) because the least significant limbs of the result can
often be discarded or not computed, at least if we don't care about correct rounding
(all the operations I implemented simply truncate towards negative infinity).
I made a diagram (click for bigger) showing how I implemented it (actually a version
of this diagram came before I started writing any code at all):The input numbers are fed in from the right hand side and the bottom, into limb-by-limb multiplication (32×32→64bit) which outputs the most significant and least significant limbs of the result to two wide accumulator busses that flow diagonally down and left. At the end the accumulators are combined into limbs by propagating carries upwards. The accumulators don't need to be super wide, I think they need O(log(number of limbs)) extra bits beyond the width of the limb, but 64bit is certainly plenty and is usually available.

One final thing to note, I tried benchmarking both big-endian and little-endian limb orderings, to see how that affected things. To my surprise big-endian came out a little faster, surprising because most operations operate over limbs from the little end first, but I think that was because I was only testing with 1 limb before the point, which potentially made an addition (statically known to be +0) be completely omitted by the optimizer. At zoom 1e90, which corresponds to about 10 limbs per number, simple escape time fixed-point computation of a Mandelbrot set location on my AMD RX 580 GPU was only about 12-13 times slower (total elapsed wall-clock time) than KF using advanced perturbation and series approximation algorithms on my AMD Ryzen 7 2700X CPU. I want to find a way to compare total power consumption, maybe I need to get a metering hardware plug or some such.

You can find my implementation here:
`git clone https://code.mathr.co.uk/fixed-point.git`