# Floating Point

Floating point represents small and large values with similar relative accuracies in a fixed amount of space.

It’s much like scientific notation, with a fixed number number of significant digits, multiplied by a power of the number base.

# 1 Representable Values

$+0, -0, +\infty, -\infty$ , and lots of NaN.

Normal numbers of the form $\pm M \times 2^{E - B - (P-1)}, \quad M \in [2^{P-1}, 2^P) \subset \mathbb{N}, \quad E \in[1, 2^R - 1) \subset \mathbb{N}.$

Denormalized numbers of the form $\pm M \times 2^{1-B - (P-1)}, \quad M \in [1, 2^{P-1})\subset \mathbb{N}.$

Parameters:

$2$ : base (usually $2$ , sometimes $10$ , rarely anything else)
$P$ : precision (number of mantissa bits including implicit leading $1$ )
$R$ : range (number of bits for exponent)
$B$ : bias (offset for exponent, usually midway: $B = 2^{R-1}-1$ ).

# 2 Standard Types

Name	Size	Precision	Range	Denormal	Smallest	Largest
half	16	11	5	6.0e-8	6.10e-5	65504.0
single	32	24	8	1.0e-45	1.18e-38	3.40e38
double	64	53	11	5.0e-324	2.23e-308	1.80e308
x87 long double	80	64	15	4.0e-4951	3.36e-4932	1.19e4932
quad double	128	113	15	6.0e-4966	3.36e-4932	1.19e4932

# 3 Data Format

Sign bit.

clear: positive
set: negative

Exponent bits.

all clear: zero (mantissa zero) or denormalized (mantissa nonzero)
all set: infinity (mantissa zero) or NaN (mantissa nonzero)
otherwise: normalized number

Mantissa bits.

leading 1 is implicit for normalized numbers
leading 1 is explicit for denormalized numbers
negative numbers are not complemented (just change sign bit)

Note: x87 long double is different.

# 4 `long double`

In the C programming language, long double is implementation dependent.

The language standard says only that it has at least as much precision as double. Some platforms have long double as an alias for double (MSVC, armv7). Others use the x87 format (i386, x86_64/amd64). The x87 format may be padded to 96 bits or 128 bits, depending on compiler options and architecture. Some other platforms alias long double to _Float128 which is another name for quad double (aarch64).

# Floating Point

# 1 Representable Values

# 2 Standard Types

# 3 Data Format

# 4 long double

# 4 `long double`