mathr / blog / #

Fixed-point numerics

Computerized representations of numbers come in a few different flavours. The most common is an int, which is typically 4 bytes long, or 32 bits. A bit is either 0 or 1, and a byte is the smallest unit addressable by the CPU and consists of 8 bits, able to store 28 or 256 distinct values. An unsigned int stores numbers from 0 to 232-1, and a signed int uses twos-complement to represent negative numbers: the most significant bit is 1 for negative numbers, and sign extension means it really represents an infinite stream of 1s extending to the left. You can find the twos-complement of a number by inverting all the bits and then add 1 to the whole number.

Computers also have float numbers, which have a floating point, like scientific notation for numbers. The number of significant digits is fixed, but you can represent very large numbers and very small numbers by combining the mantissa digits with an exponent, which moves the point separating the fractional part. I'll talk more about this in a future post. Floating point typically uses a magnitude-and-sign representation instead of twos-complement.

Fixed-point numbers are somewhere in between, they can represent numbers with fractional parts, but the number of digits after the point is fixed, so very small numbers cannot be represented as accurately as numbers with magnitude near 1. For some purposes this is just fine, as the numbers you are concerned with always have a magnitude near 1. For example, when calculating the Mandelbrot set, each pixel's c value is between -2-2i and +2+2i, and the interesting parts have |c|>0.25 too. The escape time algorithm iterates z→z²+c starting from 0 until |z|>2 (or you give up after an arbitrary iteration count limit) which means most the z values have magnitude near 1. This makes high precision fixed-point a good fit: you need more digits when zooming in, otherwise you can't represent neighbouring pixels' c values, but you don't need the more-complex algorithms of floating-point numerics because you don't need to represent very small or very large numbers.

So how do you implement fixed-point numerics on a computer? I used unsigned int limbs, which are essentially digits in base 232, and stored numbers as an array of limbs and two numbers counting the number of limbs before the point and the number of limbs after the point. My code is quite general, with mixed precision operations, but this aspect is not so well tested as the Mandelbrot set calculations only need 1 limb before the point. The most primitive operations are reading a limb from the vector, which does sign extension in twos-complement for limbs larger than the stored array, and pads with 0 for smaller limbs; and writing a limb to a vector, which has some assertions to crash fast instead of corrupting memory with an out of bounds write.

-a is implemented by complementing all the limbs and adding 1 in the least significant place, propagating carries along towards the more significant limbs. Carries can be calculated by comparing the result of addition with the addends, if the result is smaller then it overflowed (unsigned int arithmetic is modulo 232) and you need to add 1 to the next limb.

a<<n and a>>n can be calculated by reading two adjacent limbs into a double-wide unsigned long long int (there is disagreement over whether long is 32bit or 64bit on different systems, long long is 64bit almost everywhere I think) and shift the correct 32bit part out of it while looping over the arrays, being careful not to clobber the array you are reading from in case it is aliased to the array you are writing to (it is nice to be able to shift in-place to save memory).

a+b is implemented by zipping from least significant limb to most significant, propagating carries along and adding them to the next pair of limbs. a-b is implemented similarly, only propagating borrows to subtract from the next pair. Borrow is calculated by the subtracted limb being larger than the limb from which it is subtracted. OpenGL 4 / GLSL 400 and above have uaddCarry() and usubBorrow() functions which make this easier to implement, but I wrote my implementation in OpenCL / C / C++ which afaik doesn't have these.

a2 is implemented as an optimization of multiplication taking advantage of symmetry to optimize duplicated work like x*y+y*x as (x*y)<<1 which may make a small difference. Multiplication a*b is the hardest algorithm to implement (of these so far, I didn't try division yet). The naive way of doing it is to multiply every limb in a by every limb in b and collect up the results, so that is what I did as a first pass for simplicity (more efficient algorithms like Karatsuba multiplication may be interesting to investigate later). It differs from integer multiplication as found in libgmp (for example) because the least significant limbs of the result can often be discarded or not computed, at least if we don't care about correct rounding (all the operations I implemented simply truncate towards negative infinity). I made a diagram (click for bigger) showing how I implemented it (actually a version of this diagram came before I started writing any code at all):

fixed-point multiplication circuit

The input numbers are fed in from the right hand side and the bottom, into limb-by-limb multiplication (32×32→64bit) which outputs the most significant and least significant limbs of the result to two wide accumulator busses that flow diagonally down and left. At the end the accumulators are combined into limbs by propagating carries upwards. The accumulators don't need to be super wide, I think they need O(log(number of limbs)) extra bits beyond the width of the limb, but 64bit is certainly plenty and is usually available.

One final thing to note, I tried benchmarking both big-endian and little-endian limb orderings, to see how that affected things. To my surprise big-endian came out a little faster, surprising because most operations operate over limbs from the little end first, but I think that was because I was only testing with 1 limb before the point, which potentially made an addition (statically known to be +0) be completely omitted by the optimizer. At zoom 1e90, which corresponds to about 10 limbs per number, simple escape time fixed-point computation of a Mandelbrot set location on my AMD RX 580 GPU was only about 12-13 times slower (total elapsed wall-clock time) than KF using advanced perturbation and series approximation algorithms on my AMD Ryzen 7 2700X CPU. I want to find a way to compare total power consumption, maybe I need to get a metering hardware plug or some such.

You can find my implementation here: git clone