# Floating Point
Floating point represents small and large values with similar relative accuracies in a fixed amount of space.
It’s much like scientific notation, with a fixed number number of significant digits, multiplied by a power of the number base.
# 1 Representable Values
+0,−0,+∞,−∞, and lots of NaN
.
Normal numbers of the form ±M×2E−B−(P−1),M∈[2P−1,2P)⊂N,E∈[1,2R−1)⊂N.
Denormalized numbers of the form ±M×21−B−(P−1),M∈[1,2P−1)⊂N.
Parameters:
-
2: base (usually 2, sometimes 10, rarely anything else)
-
P: precision (number of mantissa bits including implicit leading 1)
-
R: range (number of bits for exponent)
-
B: bias (offset for exponent, usually midway: B=2R−1−1).
# 2 Standard Types
Name | Size | Precision | Range | Denormal | Smallest | Largest |
---|---|---|---|---|---|---|
half | 16 | 11 | 5 | 6.0e-8 | 6.10e-5 | 65504.0 |
single | 32 | 24 | 8 | 1.0e-45 | 1.18e-38 | 3.40e38 |
double | 64 | 53 | 11 | 5.0e-324 | 2.23e-308 | 1.80e308 |
x87 long double | 80 | 64 | 15 | 4.0e-4951 | 3.36e-4932 | 1.19e4932 |
quad double | 128 | 113 | 15 | 6.0e-4966 | 3.36e-4932 | 1.19e4932 |
# 3 Data Format
Sign bit.
- clear: positive
- set: negative
Exponent bits.
- all clear: zero (mantissa zero) or denormalized (mantissa nonzero)
- all set: infinity (mantissa zero) or NaN (mantissa nonzero)
- otherwise: normalized number
Mantissa bits.
- leading 1 is implicit for normalized numbers
- leading 1 is explicit for denormalized numbers
- negative numbers are not complemented (just change sign bit)
Note: x87 long double
is different.
# 4 long double
In the C programming language, long double
is implementation dependent.
The language standard says only that it has
at least as much precision as double
.
Some platforms have long double
as an alias for double
(MSVC, armv7).
Others use the x87 format (i386, x86_64/amd64).
The x87 format may be padded to 96 bits or 128 bits,
depending on compiler options and architecture.
Some other platforms alias long double
to _Float128
which is another name for quad double
(aarch64).