Floating Point
Floating point represents small and large values with similar relative accuracies in a fixed amount of space.
It’s much like scientific notation, with a fixed number number of significant digits, multiplied by a power of the number base.
1 Representable Values
\(+0, 0, +\infty, \infty\), and lots of NaN
.
Normal numbers of the form \[\pm M \times 2^{E  B  (P1)}, \quad M \in [2^{P1}, 2^P) \subset \mathbb{N}, \quad E \in[1, 2^R  1) \subset \mathbb{N}.\]
Denormalized numbers of the form \[\pm M \times 2^{1B  (P1)}, \quad M \in [1, 2^{P1})\subset \mathbb{N}.\]
Parameters:

\(2\): base (usually \(2\), sometimes \(10\), rarely anything else)

\(P\): precision (number of mantissa bits including implicit leading \(1\))

\(R\): range (number of bits for exponent)

\(B\): bias (offset for exponent, usually midway: \(B = 2^{R1}1\)).
2 Standard Types
Name  Size  Precision  Range  Denormal  Smallest  Largest 

half  16  11  5  6.0e8  6.10e5  65504.0 
single  32  24  8  1.0e45  1.18e38  3.40e38 
double  64  53  11  5.0e324  2.23e308  1.80e308 
x87 long double  80  64  15  4.0e4951  3.36e4932  1.19e4932 
quad double  128  113  15  6.0e4966  3.36e4932  1.19e4932 
3 Data Format
Sign bit.
 clear: positive
 set: negative
Exponent bits.
 all clear: zero (mantissa zero) or denormalized (mantissa nonzero)
 all set: infinity (mantissa zero) or NaN (mantissa nonzero)
 otherwise: normalized number
Mantissa bits.
 leading 1 is implicit for normalized numbers
 leading 1 is explicit for denormalized numbers
 negative numbers are not complemented (just change sign bit)
Note: x87 long double
is different.
4 long double
In the C programming language, long double
is implementation dependent.
The language standard says only that it has
at least as much precision as double
.
Some platforms have long double
as an alias for double
(MSVC, armv7).
Others use the x87 format (i386, x86_64/amd64).
The x87 format may be padded to 96 bits or 128 bits,
depending on compiler options and architecture.
Some other platforms alias long double
to _Float128
which is another name for quad double
(aarch64).