# # Square Root

## # 1 Householder Reciprocal Square Root

See Householder’s method for a family of iterative reciprocal square root calculations that cost $$O(M(N) \log(N))$$ where $$M(N)$$ is the cost of multiplying numbers with precision $$N$$. You need one high precision reciprocal afterwards.

## # 2 Magic Number Reciprocal Square Root

Popularized by Quake 3, an approximate reciprocal square root accurate to about 5 bits (2% relative error) can be calculated cheaply by

#define MAGIC 0x5f37642f
float rsqrt_approx(float x)
{
int i = *(int*)& x;
i = MAGIC - (i >> 1);
return *(float*)& i;

}


This assumes the details of IEEE binary floating point format.

Similar magic numbers exist for double and quadruple precision (float128) but are largely useless if the intent is to refine to an accurate value with Newton’s method or similar: it’s faster to range reduce to single precision, use the single precision approximation, followed by Newton’s method with increasing precision.

Each Newton step roughly doubles the number of accurate bits, so 3 steps are enough for single, 4 for double, 5 for quadruple. The Newton step is:

x *= fma((-y/2) * x, x, 1.5);


For details of derivation of MAGIC see Chris Lomont’s paper on Fast Inverse Square Root.

Using Lomont’s $$r_0$$ I calculated for half, single, double, quadruple:

59bb
5f37642f
5fe6ec85e7de30db
5ffe6ec85e7de30dab0ff77452ab6769


where the biased exponent with $$e$$ bits in the format is $3 \times 2^{e-2} - 2$ and the mantissa including implicit leading 1 with $$m$$ explicit bits is $\lfloor 2^m \times (1 + r_0) + \frac{1}{2} \rfloor$ with $r_0 = 0.432744889959443195468521587014\ldots.$

This section was inspired by a Math.StackExchange question about fast inverse square root for quadruple precision.

## # 3 Vedic Square Root

I found a “Vedic” square root algorithm described at math.stackexchange.com.

The key step in the algorithm, which works digit by digit, is decomposing $(10 a + d)^2 = 10^2 a^2 + (10 (2 a) + d) d$

Split the input into pairs of adjacent digits (not overlapping the decimal point).

Set the accumulator ($$2 a$$ in the equation above) to 0.

Start calculating with the first pair as the number (which may be a single digit).

Repeat until desired accuracy:

1. Compute the largest $$d$$ such that $$(10 (2 a) + d) d$$ is less than or equal to the number.

2. Compute the remainder (the number minus $$(10 (2 a) + d) d$$).

3. Append the next pair of input digits to the remainder as the new number.

4. Multiply the accumulator $$2 a$$ by 10 and add $$2 d$$ to it.

5. Output the digit $$d$$.

This works for any integer base greater than or equal to 2 (just replace all 10 by the base).

Needs only shifts, multiplication by single digits, addition, subtraction (each $$O(N)$$). The accumulator grows, having the same number of digits as the output after each stage. This makes the total cost $$O(N^2)$$ for $$N$$ output digits.

Partial implementation without a nice API (requires an infinite stream of input digits). Laziness means you can take as much output as you need.

b :: Integer
b = 10

sqrt :: Integer -> Integer -> [Integer] -> [Integer]
sqrt acc s (x:y:zs) = d :
sqrt (b * acc + 2 * d) (b^2 * r + b * x + y) zs
where
d = maximum [ d | d <- [0..b-1], (b * acc + d) * d <= s ]
r = s - (b * acc + d) * d

-- sqrt 0 x (y:zs)         -- first pair is one digit
-- sqrt 0 (b * x + y) zs   -- first pair is two digits


### # 3.2 C Implementation

Finds 32.32 fixed point square root of 64.0 integer input.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

uint64_t isqrt(uint64_t x)
{
if (x == 0)
{
return 0;
}
int shift;
for (shift = 63; shift >= 0; --shift)
{
if (x & (1 << shift))
{
break;
}
}
// highest bit is 1 << shift
// group in pairs of digits
if (shift & 1)
{
--shift;
}
uint64_t s = 0;
uint64_t acc = 0;
uint64_t out = 0;
// stop at shift >= 0 for integer result
// we have space for 32 fractional digits in the output type
for (; shift >= -64; shift -= 2)
{
// get next pair of digits
uint64_t in = shift >= 0 ? ((x >> shift) & 3) : 0;
s <<= 2;
s += in;
uint64_t d = ((acc << 1) + 1) * 1 <= s;
uint64_t q = ((acc << 1) + d) * d;
s -= q;
out <<= 1;
out += d;
acc += d;
acc <<= 1;
}
return out;
}

int main(int argc, char **argv)
{
if (argc != 2)
{
return 1;
}
printf("%.10Lf\n", isqrt(atoll(argv)) / (long double) ((uint64_t) 1 << 32));
return 0;
}


## # 4 Perturbation

Sometimes it’s necessary to compute things like

$\sqrt{X + x} - \sqrt{X}$

which is inaccurate when $$|x| << |X|$$: if $$|x|$$ is small enough compared to $$|X|$$ and the precision available, then evaluating $$X + x$$ will just give X, and the final answer will be $$0$$ instead of a tiny number.

There is a neat trick using difference of squares factorization and symbolic algebraic manipulation:

\begin{aligned} \sqrt{X + x} - \sqrt{X} &= \left(\sqrt{X + x} - \sqrt{X}\right) \times \frac{\sqrt{X + x} + \sqrt{X}}{\sqrt{X + x} + \sqrt{X}} \\ &= \frac{\sqrt{X + x}^2 - \sqrt{X}^2}{\sqrt{X + x} + \sqrt{X}} \\ &= \frac{(X + x) - (X)}{\sqrt{X + x} + \sqrt{X}} \\ &= \frac{x}{\sqrt{X + x} + \sqrt{X}} \end{aligned}

When $$|x|$$ is tiny, this evaluates to $$\frac{x}{2\sqrt{X}}$$, no catastrophic cancellation problem any more.