# Floating Point
Floating point represents small and large values with similar relative accuracies in a fixed amount of space.
It’s much like scientific notation, with a fixed number number of significant digits, multiplied by a power of the number base.
# 1 Representable Values
\(+0, -0, +\infty, -\infty\), and lots of NaN
.
Normal numbers of the form \[\pm M \times 2^{E - B - (P-1)}, \quad M \in [2^{P-1}, 2^P) \subset \mathbb{N}, \quad E \in[1, 2^R - 1) \subset \mathbb{N}.\]
Denormalized numbers of the form \[\pm M \times 2^{1-B - (P-1)}, \quad M \in [1, 2^{P-1})\subset \mathbb{N}.\]
Parameters:
-
\(2\): base (usually \(2\), sometimes \(10\), rarely anything else)
-
\(P\): precision (number of mantissa bits including implicit leading \(1\))
-
\(R\): range (number of bits for exponent)
-
\(B\): bias (offset for exponent, usually midway: \(B = 2^{R-1}-1\)).
# 2 Standard Types
Name | Size | Precision | Range | Denormal | Smallest | Largest |
---|---|---|---|---|---|---|
half | 16 | 11 | 5 | 6.0e-8 | 6.10e-5 | 65504.0 |
single | 32 | 24 | 8 | 1.0e-45 | 1.18e-38 | 3.40e38 |
double | 64 | 53 | 11 | 5.0e-324 | 2.23e-308 | 1.80e308 |
x87 long double | 80 | 64 | 15 | 4.0e-4951 | 3.36e-4932 | 1.19e4932 |
quad double | 128 | 113 | 15 | 6.0e-4966 | 3.36e-4932 | 1.19e4932 |
# 3 Data Format
Sign bit.
- clear: positive
- set: negative
Exponent bits.
- all clear: zero (mantissa zero) or denormalized (mantissa nonzero)
- all set: infinity (mantissa zero) or NaN (mantissa nonzero)
- otherwise: normalized number
Mantissa bits.
- leading 1 is implicit for normalized numbers
- leading 1 is explicit for denormalized numbers
- negative numbers are not complemented (just change sign bit)
Note: x87 long double
is different.
# 4 long double
In the C programming language, long double
is implementation dependent.
The language standard says only that it has
at least as much precision as double
.
Some platforms have long double
as an alias for double
(MSVC, armv7).
Others use the x87 format (i386, x86_64/amd64).
The x87 format may be padded to 96 bits or 128 bits,
depending on compiler options and architecture.
Some other platforms alias long double
to _Float128
which is another name for quad double
(aarch64).