# Floating Point
Floating point represents small and large values with similar relative accuracies in a fixed amount of space.
It’s much like scientific notation, with a fixed number number of significant digits, multiplied by a power of the number base.
# 1 Representable Values
\(+0, -0, +\infty, -\infty\), and lots of NaN.
Normal numbers of the form \[\pm M \times 2^{E - B - (P-1)}, \quad M \in [2^{P-1}, 2^P) \subset \mathbb{N}, \quad E \in[1, 2^R - 1) \subset \mathbb{N}.\]
Denormalized numbers of the form \[\pm M \times 2^{1-B - (P-1)}, \quad M \in [1, 2^{P-1})\subset \mathbb{N}.\]
Parameters:
-
\(2\): base (usually \(2\), sometimes \(10\), rarely anything else)
-
\(P\): precision (number of mantissa bits including implicit leading \(1\))
-
\(R\): range (number of bits for exponent)
-
\(B\): bias (offset for exponent, usually midway: \(B = 2^{R-1}-1\)).
# 2 Standard Types
| Name | Size | Precision | Range | Denormal | Smallest | Largest |
|---|---|---|---|---|---|---|
| half | 16 | 11 | 5 | 6.0e-8 | 6.10e-5 | 65504.0 |
| single | 32 | 24 | 8 | 1.0e-45 | 1.18e-38 | 3.40e38 |
| double | 64 | 53 | 11 | 5.0e-324 | 2.23e-308 | 1.80e308 |
| x87 long double | 80 | 64 | 15 | 4.0e-4951 | 3.36e-4932 | 1.19e4932 |
| quad double | 128 | 113 | 15 | 6.0e-4966 | 3.36e-4932 | 1.19e4932 |
# 3 Data Format
Sign bit.
- clear: positive
- set: negative
Exponent bits.
- all clear: zero (mantissa zero) or denormalized (mantissa nonzero)
- all set: infinity (mantissa zero) or NaN (mantissa nonzero)
- otherwise: normalized number
Mantissa bits.
- leading 1 is implicit for normalized numbers
- leading 1 is explicit for denormalized numbers
- negative numbers are not complemented (just change sign bit)
Note: x87 long double is different.
# 4 long double
In the C programming language, long double is implementation dependent.
The language standard says only that it has
at least as much precision as double.
Some platforms have long double as an alias for double (MSVC, armv7).
Others use the x87 format (i386, x86_64/amd64).
The x87 format may be padded to 96 bits or 128 bits,
depending on compiler options and architecture.
Some other platforms alias long double to _Float128
which is another name for quad double (aarch64).