source: IEEE Standard 754 Floating-Point

problems of fixed-point representation:

has a fixed window of representation → limit representing very large and very small numbers
prone to loss of precision during division on large numbers

floating-point representation:

uses scientific notation, with a base number and an exponent:

man t i ss a \cdot ba s e^{e x p o n e n t}

employs a sort of “dynamic window” of precision appropriate to the scale of the number: the smaller the the exponent, the higher the precision

storage layout

[sign (1)][exponent (5/8/11)][fraction (11/23/52)]

Helf (16bit):   S|EEEEE|FF FFFFFFFF
Single (32bit): S|EEEEEEE E|FFFFFFF FFFFFFFF FFFFFFFF
Double (64bit): S|EEEEEEE EEEE|FFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF

sign: 0 = positive, 1 = negative
exponent:
- use offset binary with bias = 127 (2^7 - 1) for 32bit and 1023 (2^10 - 1) for 64bit
- all 0s (-127/-1023) and all 1s (+128/+1024) are reserved for special numbers.
mantissa:
- aka significand
- composed of the fraction and an implicit leading digit
- usually stored in normalized form:
  - puts the radix point after the 1st non-zero digit
  - in binary, there is only 1 possible non-zero digit → assume the leading digit of 1 for optimization → 32bit number effectively has 24 bits of mantissa (53 for 64bit).
- There is also denormalized formed

Unrepresentable ranges

negatives number less than −(2−2^−23) × 2^127 (negative overflow)
negative numbers greater than −2^−149 (negative underflow)
zero
positive numbers less than 2^−149 (positive underflow)
positive numbers greater than (2−2^−23) × 2^127 (positive overflow)

special numbers

denormalized

aka subnormal (denormal numbers are renamed subnormal numbers in IEEE 754-2008)
When the exponent is all 0s
Now has an assumed leading 0
the exponent is now -126 for 32bit (instead of -127)
- Why ? Because normalized form closest to 0 at +-2^-126
loses precision as it gets smaller (as left bits of the fraction become 0s)
Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point (source in 2012)

zero

a denormalized number (exponent is all 0s) with fraction is all 0s (including implicit leading 0)
-0 and +0 are distinct values but compare as equal.

infinity

denoted with exponent of all 1s
fraction of all 0s
Being able to denote infinity as a specific value is useful because it allows operations to continue past overflow situations.
Operations with infinite values are well defined in IEEE floating point.

NaN

not a number
exponents of all 1s
non-zero fraction
2 type of NaN:
- QNaN (quiet NaN): denotes indeterminate operations
  - most significant fraction bit = 1
  - propagate freely through most arithmetic operations (the program won’t complain)
  - generated from an operation when the result is not mathematically defined
- SNaN (signalling NaN): denotes invalid operations
  - most significant fraction bit = 0
  - used to signal an exception when used in operations
  - can be handy to assign to uninitialized variables to trap premature usage.

sample questions

1. What is the smallest positive non-zero value that can be represented as a normalized double-precision number? How much larger is the next smallest value?

[sign (1)][exponent (8/11)][fraction (23/52)]

	smallest: 2^-1022
	- fraction: all 0s -> 1
	- exponent: only the least significant bit is 1 -> -1022
	
	next smallest: (1 + 2^-52) * 2^-1022
	- fraction: only least significant bit is 1 -> 1 + 2^-52
	- exponent: only the least significant bit is 1 -> -1022

2. What is the smallest positive non-zero value that can be represented as a denormalized double-precision number? How much greater than this is the next smallest value?

	smallest: 2^-52 * 2^-1022 = 2^-1074
	- fraction: only least significant bit is 1 -> 2^-52
	- exponent: all 0s -> -1022
	
	next smallest:2^-1073
	- fraction: only second least significant bit is 1 -> 2^-51
	- exponent: all 0s -> -1022

Lone's notes

Recently Updated

Diffusion language modeling with maximum semantic likelihood

distill from AR LM to diffusion LM

LLM generation is path finding in activation space, each decoder block's processing is taking a step in said space

AI resources

Controlling reasoning duration with activation steering

All notes

IEEE 754 floating-point

problems of fixed-point representation:

floating-point representation:

storage layout

Unrepresentable ranges

special numbers

denormalized

zero

infinity

NaN

sample questions

Graph View

Table of Contents

Backlinks