Fixed-point numerics revisited
A couple of years ago I wrote about fixed-point numerics. The multiplication presented there, to multiply numbers with \(n\) limbs uses \(O(n^2)\) individual limb-by-limb multiplications, which isn't very efficient.
Multiplying two \(n\)-limb numbers gives a number with \(2n\) limb. For simplicity, assume \(n\) is a power of \(2\), say \(n = 2^k\). For \(x\) with \(n\) limbs, write \(\operatorname{hi}(x)\) to mean the top half of \(x\) with \(n/2\) limb, and \(\operatorname{lo}(x)\) to mean the bottom half of \(x\) with \(n/2\) limbs. Suppose the numbers represent fractions in \([0..1)\), that is, there are no limbs before the point and \(n\) limbs after. Full multiplication would give a fraction with \(2n\) limbs after the point, but maybe we don't care for the extra precision and truncating to the first \(n\) limbs would be perfectly fine. Maybe we don't even care about errors in the last few bits.
Writing \(x \times y\) for the full multiplication that gives \(2n\) bits, we want to compute \(\operatorname{hi}(x \times y)\) as efficiently as possible. To be precise, we want to minimize the number of limb-by-limb multiplications, because they are relatively expensive compared to additions.
As a baseline, here's a simple recursive implementation of full multiplication:
\(x \times y\) | \(=\) | \(\operatorname{hi}(x)\times\operatorname{hi}(y)\) | ... | ... | |
\(+\) | ... | \(\operatorname{hi}(x)\times\operatorname{lo}(y)\) | ... | ||
\(+\) | ... | \(\operatorname{lo}(x)\times\operatorname{hi}(y)\) | ... | ||
\(+\) | ... | ... | \(\operatorname{lo}(x)\times\operatorname{lo}(y)\) |
You can see that the cost of an \(n\)-limb multiply is \(4\) times the cost of an \(n/2\)-limb multiply, which means that the overall cost works out at \(n^2\) limb-by-limb multiplies.
Now assuming truncation and errors in the last few bits are fine, the recursive implementation becomes:
\(\operatorname{hi}(x \times y)\) | \(=\) | \(\operatorname{hi}(x)\times\operatorname{hi}(y)\) | ... | ... | |
\(+\) | ... | \(\operatorname{hi}(\operatorname{hi}(x)\times\operatorname{lo}(y))\) | ... | ... | |
\(+\) | ... | \(\operatorname{hi}(\operatorname{lo}(x)\times\operatorname{hi}(y))\) | ... | ... |
The cost of an \(n\)-limb truncated multiply is \(1\) times the cost of an \(n/2\)-limb full multiply plus \(2\) times the cost of an \(n/2\)-limb truncated multiply. Working out the first few terms \[1,3,10,36,136,528,\ldots\] shows that it has about half the cost of the full multiply's \[1,4,16,64,256,1024,\ldots\] and in fact the overall cost works out at \(n(n+1)/2\) limb-by-limb multiplies, a significant improvement! This truly is a clever trick!
However, for full multiplication there is a way to reduce multiplications even more dramatically. The trick is called Karatsuba multiplication and is based on the algebraic identity \[ (a + b)\times(u + v) = a \times u + b \times u + a \times v + b \times v \] Now set \(a = \operatorname{hi}(x), b = \operatorname{lo}(x), u = \operatorname{hi}(y), v = \operatorname{lo}(y) \) and notice that the four terms on the right are what we are computing with the first recursive implementation above.
Now rewrite the algebra to move terms around: \[ b \times u + a \times v = (a + b)\times(u + v) - (a \times u + b \times v) \] so we can replace two multiplications on the left by one multiplication on the right, with the two other multiplications on the right being things we need to compute anyway.
(There are two extra \(n/2\)-bit additions for \(a+b\) and \(u+v\), and one extra \(n\) bit-bit subtraction, which is guaranteed to give a non-negative result if all of \(a,b,u,v\) are non-negative, but the cost of limb-by-limb addition and subtraction are usually much less than limb-by-limb multiplication, and \(n\)-limb addition needs \(n\) limb-by-limb additions, which is small.)
This means the cost of \(n\)-limb full multiplication is \(3\) times the cost of \(n/2\)-limb full multiplication, the first terms are \[ 1,3,9,27,81,243,\ldots \] and the cost works out to \(n^{\log_2{3}}\), which is approximately \(n^{1.585}\). This reduced power (vs \(n^2\)) is an asymptotic improvement, while the truncation trick only improved constant factors. Example, suppose \(n = 256\), then original needs \(65536\) limb-by-limb multiplications giving \(512\) limbs, truncated needs \(32896\) limb-by-limb multiplications giving \(256\) limbs, Karatsuba needs \(6561\) limb-by-limb multiplications giving \(512\) limbs. So an order of magnitude improvement!
But! This assumes the truncated multiplication is using the original full multiplication. Replacing the full multiplication with the Karatsuba multiplication, we get the cost of \(n\)-limb truncated multiplication is \(1\) times the cost of \(n/2\)-limb Karatsuba multiplication, plus \(2\) times the cost of \(n/2\)-limb truncated multiplication. This works out as... exactly the same cost as \(n\)-limb Karatsuba multiplication, only with a less precise answer!
So the Karatsuba trick is very much cleverer than truncated multiply. A truncated multiply using Karatsuba might still be worth it: it needs half the space for output, and the number of limb-by-limb additions will be smaller. There are even more complicated multiplication algorithms out there than Karatsuba that get even better asymptotic efficiencies, go see how the GMP library does it.
Addendum: with 16-bit limbs, the cost of limb-by-limb multiply on a Motorola 68000 CPU (as present in Amiga A500(+)), is 54 cycles average case (worst case 70 cycles) add 12 cycles to load inputs into registers from memory), the cost of limb-by-limb addition with carry is 4 cycles (18 cycles if operating on memory directly, 30 cycles to operate on pairs of limbs (32-bits) in memory at a time). Reference: 68000 instruction timing.