Quantization

The quantization Q of a real-world value V is represented by a weighted sum of bits. Within the context of the general slope and bias encoding scheme, the approximate real-world value $\tilde{V}$ of an unsigned fixed-point quantity is given by

$\tilde{V} = S \cdot [\sum_{i = 0}^{w s - 1} b_{i} 2^{i}] + B,$

while the approximate real-world value of a signed fixed-point quantity is given by

$\tilde{V} = S \cdot [- b_{w s - 1} 2^{w s - 1} + \sum_{i = 0}^{w s - 2} b_{i} 2^{i}] + B,$

where

$b_{i}$ are binary digits, with $b_{i} = 1, 0$ , for $i = 0, 1, ..., w s - 1$
ws is the word size in bits, with ws = 1, 2, 3,..., 65535.
S is given by $S = F * 2^{E}$ , where the scaling is unrestricted because the binary point does not have to be contiguous with the word. F is the slope adjustment factor and is a value in the range [1.0, 2.0).

$b_{i}$ are called bit multipliers and $2^{i}$ are called the weights.

Fixed-Point Format

Formats for 8-bit signed and unsigned fixed-point values are shown in the following figure.

Note that you cannot discern whether these numbers are signed or unsigned data types merely by inspection since this information is not explicitly encoded within the word.

The binary number 0011.0101 yields the same value for the unsigned and two's complement representation because the MSB = 0. Setting B = 0 and using the appropriate weights, bit multipliers, and scaling, the value is

$\begin{matrix} \tilde{V} = (F 2^{E}) Q = 2^{E} [\sum_{i = 0}^{w s - 1} b_{i} 2^{i}] \\ = 2^{- 4} (0 \times 2^{7} + 0 \times 2^{6} + 1 \times 2^{5} + 1 \times 2^{4} + 0 \times 2^{3} + 1 \times 2^{2} + 0 \times 2^{1} + 1 \times 2^{0}) \\ = 3.3125. \end{matrix}$

Conversely, the binary number 1011.0101 yields different values for the unsigned and two's complement representation since the MSB = 1.

Setting B = 0 and using the appropriate weights, bit multipliers, and scaling, the unsigned value is

$\begin{matrix} \tilde{V} = (F 2^{E}) Q = 2^{E} [\sum_{i = 0}^{w s - 1} b_{i} 2^{i}] \\ = 2^{- 4} (1 \times 2^{7} + 0 \times 2^{6} + 1 \times 2^{5} + 1 \times 2^{4} + 0 \times 2^{3} + 1 \times 2^{2} + 0 \times 2^{1} + 1 \times 2^{0}) \\ = 11.3125, \end{matrix}$

while the two's complement value is

$\begin{matrix} \tilde{V} = (F 2^{E}) Q = 2^{E} [- b_{w s - 1} 2^{w s - 1} + \sum_{i = 0}^{w s - 2} b_{i} 2^{i}] \\ = 2^{- 4} (- 1 \times 2^{7} + 0 \times 2^{6} + 1 \times 2^{5} + 1 \times 2^{4} + 0 \times 2^{3} + 1 \times 2^{2} + 0 \times 2^{1} + 1 \times 2^{0}) \\ = - 4.6875. \end{matrix}$

Quantization

Fixed-Point Format

See Also