# Floating Point

Floating point represents small and large values with similar relative accuracies in a fixed amount of space.

It’s much like scientific notation, with a fixed number number of significant digits, multiplied by a power of the number base.

# 1 Representable Values

$$+0, -0, +\infty, -\infty$$, and lots of NaN.

Normal numbers of the form $\pm M \times 2^{E - B - (P-1)}, \quad M \in [2^{P-1}, 2^P) \subset \mathbb{N}, \quad E \in[1, 2^R - 1) \subset \mathbb{N}.$

Denormalized numbers of the form $\pm M \times 2^{1-B - (P-1)}, \quad M \in [1, 2^{P-1})\subset \mathbb{N}.$

Parameters:

• $$2$$: base (usually $$2$$, sometimes $$10$$, rarely anything else)

• $$P$$: precision (number of mantissa bits including implicit leading $$1$$)

• $$R$$: range (number of bits for exponent)

• $$B$$: bias (offset for exponent, usually midway: $$B = 2^{R-1}-1$$).

# 2 Standard Types

Name Size Precision Range Denormal Smallest Largest
half 16 11 5 6.0e-8 6.10e-5 65504.0
single 32 24 8 1.0e-45 1.18e-38 3.40e38
double 64 53 11 5.0e-324 2.23e-308 1.80e308
x87 long double 80 64 15 4.0e-4951 3.36e-4932 1.19e4932
quad double 128 113 15 6.0e-4966 3.36e-4932 1.19e4932

# 3 Data Format

Sign bit.

• clear: positive
• set: negative

Exponent bits.

• all clear: zero (mantissa zero) or denormalized (mantissa nonzero)
• all set: infinity (mantissa zero) or NaN (mantissa nonzero)
• otherwise: normalized number

Mantissa bits.

• leading 1 is implicit for normalized numbers
• leading 1 is explicit for denormalized numbers
• negative numbers are not complemented (just change sign bit)

Note: x87 long double is different.

# 4 long double

In the C programming language, long double is implementation dependent.

The language standard says only that it has at least as much precision as double. Some platforms have long double as an alias for double (MSVC, armv7). Others use the x87 format (i386, x86_64/amd64). The x87 format may be padded to 96 bits or 128 bits, depending on compiler options and architecture. Some other platforms alias long double to _Float128 which is another name for quad double (aarch64).