## 5DV008

## Computer Architecture

Umeå University
Department of Computing Science
Stephen J. Hegner

## Topic 3: Arithmetic

These slides are mostly taken verbatim, or with minor changes, from those prepared by
Mary Jane Irwin (www.cse.psu.edu/~mji) of The Pennsylvania State University
[Adapted from Computer Organization and Design, $4^{\text {th }}$ Edition, Patterson \& Hennessy, © 2008, MK]
11/16/10
5DV008 20101611 t3 sl:1
1
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$

## Key to the Slides

- The source of each slide is coded in the footer on the right side:
- Irwin CSE331 = slide by Mary Jane Irwin from the course CSE331 (Computer Organization and Design) at Pennsylvania State University.
- Irwin CSE431 = slide by Mary Jane Irwin from the course CSE431 (Computer Architecture) at Pennsylvania State University.
- Hegner UU = slide by Stephen J. Hegner at Umeå University.

2
SDV008 20101611 t t3 sl:2

## Review: MIPS (RISC) Design Principles

$\square$ Simplicity favors regularity

- fixed size instructions
- small number of instruction formats
- opcode always the first 6 bits
$\square$ Smaller is faster
- limited instruction set
- limited number of registers in register file
- limited number of addressing modes
$\square$ Make the common case fast
- arithmetic operands from the register file (load-store machine)
- allow instructions to contain immediate operands
- Good design demands good compromises
- three instruction formats

11/16/10
5DVoos 20101611 t. s sl:3
$3 \quad$ Irwin CSE431 PSU


## Number Representations

- 32-bit signed numbers (2's complement):

- Converting < 32-bit values into 32-bit values
- copy the most significant bit (the sign bit) into the "empty" bits

$$
\begin{array}{llll}
0010 & \text {-> } 0000 & 0010 \\
1010 & \text {-> } 1111 & 1010
\end{array}
$$

- sign extend versus zero extend (lb vs. lbu) 11/16/10
5DVoos 20101611 t.3 sl:5


## MIPS Arithmetic Logic Unit (ALU) <br> - Must support the Arithmetic/Logic operations of the ISA <br> add, addi, addiu, addu <br> sub, subu <br> mult, multu, div, divu <br> sqrt <br> and, andi, nor, or, ori, xor, xori 32 <br> beq, bne, slt, slti, sltiu, sltu m (operation)

## $\square$ With special handling for

- sign extend-addi, addiu, slti, sltiu
- zero extend - andi, ori, xori
- overflow detection-add, addi, sub
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
11/16/10
5DVoos 20101611 t.3 sl:6
6


## Dealing with Overflow

$\square$ Overflow occurs when the result of an operation cannot be represented in 32-bits, i.e., when the sign bit contains a value bit of the result and not the proper sign bit

When adding operands with different signs or when subtracting operands with the same sign, overflow can never occur

| Operation | Operand A | Operand B | Result indicating <br> overflow |
| :---: | :---: | :---: | :---: |
| $\mathrm{A}+\mathrm{B}$ | $\geq 0$ | $\geq 0$ | $<0$ |
| $\mathrm{~A}+\mathrm{B}$ | $<0$ | $<0$ | $\geq 0$ |
| $\mathrm{~A}-\mathrm{B}$ | $\geq 0$ | $<0$ | $<0$ |
| $\mathrm{~A}-\mathrm{B}$ | $<0$ | $\geq 0$ | $\geq 0$ |

$\square$ MIPS signals overflow with an exception (aka interrupt) an unscheduled procedure call where the EPC contains $11 / 16$ the address of the instruction that caused the exception 5DV008 20101611 t. 3 s:7


$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$

8
Irwin CSE431 PSU

## But What about Performance?

$\square$ Critical path of $n$-bit ripple-carry adder is $n^{*} \mathrm{CP}$


- Design trick - throw hardware at it (Carry Lookahead)


## 11/16/10

5DV008 20101611 t s sl:9 $\quad 9 \quad$ Irwin CSE431 PSU

## Multiply

- Binary multiplication is just a bunch of right shifts and adds


11/16/10
5DVoos 20101611 t t3 sl:10
10
Irwin CSE431 PSU

## Add and Right Shift Multiplier Hardware



11/16/10
5DV008 20101611 t3 st: 11

## 11

IIwin CSE431 PSU

## Add and Right Shift Multiplier Hardware



## MIPS Multiply Instruction

- Multiply (mult and multu) produces a double precision product
mult $\$ s 0, \$ s 1 \quad \#$ hi\|llo $=\$ s 0$ * $\$ \mathrm{~s} 1$

| 0 | 16 | 17 | 0 | 0 | $0 \times 18$ |
| :--- | :--- | :--- | :--- | :--- | :--- |

- Low-order word of the product is left in processor register lo and the high-order word is left in register hi
- Instructions mfhi rd and mflo rd are provided to move the product to (user accessible) registers in the register file

Multiplies are usually done by fast, dedicated hardware and are much more complex (and slower) than adders
11/16/10
5DVoos 20101611 t 3 sl: 13

## Fast Multiplication Hardware

- Can build a faster multiplier by using a parallel tree of adders with one 32-bit adder for each bit of the multiplier at

$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$


## Division

- Division is just a bunch of quotient digit guesses and left shifts and subtracts
dividend $=$ quotient $\times$ divisor + remainder

$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$

$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
11/16/10


## Left Shift and Subtract Division Hardware



## MIPS Divide Instruction

- Divide (div and divu) generates the remainder in hi and the quotient in lo
div \$s0, \$s

> \# lo $=\$ s 0 / \$ s 1$
> $\#$ hi $=\$ s 0 \bmod \$ s 1$

| 0 | 16 | 17 | 0 | 0 | $0 \times 1 \mathrm{~A}$ |
| :--- | :--- | :--- | :--- | :--- | :--- |

- Instructions mfhi rd and mflo rd are provided to move the quotient and remainder to (user accessible) registers in the register file
- As with multiply, divide ignores overflow so software must determine if the quotient is too large. Software must also check the divisor to avoid division by 0 .


## 11/16/10

$\square$ What if we want to encode the approx. age of the earth? $4,600,000,000$ or $4.6 \times 10^{9}$
or the weight in kg of one a.m.u. (atomic mass unit) 0.0000000000000000000000000166 or $1.6 \times 10^{-27}$

There is no way we can encode either of the above in a 32-bit integer.
$\square$ Floating point representation $(-1)^{\text {sign }} \times F \times 2^{E}$

- Still have to fit everything in 32 bits (single precision)

| S | $E$ (exponent) | $F$ (fraction) |
| :---: | :---: | :--- |
| 1 bit $\quad 8$ bits | 23 bits |  |

- The base $(2$, not 10$)$ is hardwired in the design of the FPALU
- More bits in the fraction ( $F$ ) or the exponent ( $E$ ) is a trade-off between precision (accuracy of the number) and range (size of 11/16/10 the number)
5DVoos 20101611 t3 sl:19
19
Irwin CSE431 PSU


## Exception Events in Floating Point

- Overflow (floating point) happens when a positive exponent becomes too large to fit in the exponent field
- Underflow (floating point) happens when a negative exponent becomes too large to fit in the exponent field

+ largestE -largestF
+ largestE + largestF
- One way to reduce the chance of underflow or overflow is to offer another format that has a larger exponent field
- Double precision - takes two MIPS words

| s E (exponent) |  | F (fraction) |
| :---: | :---: | :---: |
| 1 bit | 11 bits | 20 bits |
| $F$ (fraction continued) |  |  |

## IEEE 754 FP Standard

- Most (all?) computers these days conform to the IEEE 754 floating point standard $(-1)^{\text {sign }} \times(1+F) \times 2^{\text {E-bias }}$
- Formats for both single and double precision
- F is stored in normalized format where the msb in $F$ is 1 (so there is no need to store it!) - called the hidden bit
- To simplify sorting FP numbers, E comes before F in the word and $E$ is represented in excess (biased) notation where the bias is -127 (-1023 for double precision) so the most negative is 00000001 $(-1023$ for double precision) so the most negative is 00000001 $=2^{1-127}$
$\mathbf{2}^{+127}$
- Examples (in normalized format)
- Smallest+: $0000000011.00000000000000000000000=1 \times 2^{1-127}$
- Zero: $00000000000000000000000000000000=$ true 0
- Largest+: $0111111101.11111111111111111111111=2-2^{-23} \mathrm{x}$ $2^{254-127}$
- $1.0_{2} \times 2^{-1}=$
$11 / 16 / \oplus 0.75_{10} \times 2^{4}=$

```
5DV008 20101611 t3 sl:21
\(\square\) Most (all?) computers these days conform to the IEEE 754 floating point standard \((-1)^{\text {sign }} \times(1+\mathrm{F}) \times 2^{\text {E-bias }}\)
- Formats for both single and double precision
- F is stored in normalized format where the msb in F is 1 (so there is no need to store it!) - called the hidden bit
- To simplify sorting FP numbers, E comes before Fin the word and E is represented in excess (biased) notation where the bias is -127 ( 1023 for double precision) so the most negative is \(00000001=2^{1-127}\) \(=2^{-126}\) and the most positive is \(11111110=2^{254-127}=2^{+127}\)
- Examples (in normalized format)
- Smallest+: \(0000000011.00000000000000000000000=1 \times 2^{1-127}\)
- Zero: \(00000000000000000000000000000000=\) true 0
- Largest+: \(0111111101.11111111111111111111111=2-2^{-23} \mathrm{x}\) \(2^{254-127}\)
- \(1.0_{2} \times 2^{-1}=\)
0011111101.00000000000000000000000
\(\stackrel{\bullet}{6 / 10} 0.75_{10} \times 2^{4}=0100000101.10000000000000000000000\)
SDVoos 20101611 t 3 sl:22 \(22 \quad\) Irwin CSE431 PSU

\section*{IEEE 754 FP Standard Encoding}
\(\square\) Special encodings are used to represent unusual events
- \(\pm\) infinity for division by zero
- NaN (not a number) for the results of invalid operations such as \(0 / 0\)
- True zero is the bit string all zero
\begin{tabular}{|c|c|c|c|l|}
\hline \multicolumn{2}{|c|}{\begin{tabular}{c} 
Single \\
Precision
\end{tabular}} & \begin{tabular}{c} 
Double \\
Precision
\end{tabular} & \multicolumn{1}{c|}{\begin{tabular}{c} 
Object \\
Represented
\end{tabular}} \\
\hline \(\mathrm{E}(8)\) & \(\mathrm{F}(23)\) & \(\mathrm{E}(11)\) & \(\mathrm{F}(52)\) & \\
\hline 00000000 & 0 & \(0000 \ldots 0000\) & 0 & true zero (0) \\
\hline 00000000 & nonzero & \(0000 \ldots 0000\) & nonzero & \begin{tabular}{l}
\(\pm\) denormalized \\
number
\end{tabular} \\
\hline \begin{tabular}{c}
01111111 \\
to \(+127,-126\)
\end{tabular} & anything & \begin{tabular}{c}
\(0111 \ldots 1111\) \\
to \(+1023,-1022\)
\end{tabular} & anything & \begin{tabular}{l}
\(\pm\) floating point \\
number
\end{tabular} \\
\hline 11111111 & +0 & \(1111 \ldots 1111\) & -0 & \(\pm \pm\) infinity \\
\hline 11111111 & nonzero & \(1111 \ldots 1111\) & nonzero & \begin{tabular}{l} 
not a number \\
(NaN \()\)
\end{tabular} \\
\hline
\end{tabular}

SDVOO8 20101611 t.3 st
23

\section*{Support for Accurate Arithmetic}
- IEEE 754 FP rounding modes
- Always round up (toward \(+\infty\) )
- Always round down (toward \(-\infty\) )
- Truncate
- Round to nearest even (when the Guard || Round || Sticky are 100) - always creates a 0 in the least significant (kept) bit of \(F\)
\(\square\) Rounding (except for truncation) requires the hardware to include extra F bits during calculations
- Guard bit - used to provide one F bit when shifting left to normalize a result (e.g., when normalizing \(F\) after division or subtraction)
- Round bit - used to improve rounding accuracy
- Sticky bit - used to support Round to nearest even; is set to a 1 whenever a 1 bit shifts (right) through it (e.g., when aligning F during addition/subtraction)

\section*{\(F=1 . x x x x x x x x x x x x x x x x x x x x x x \in R S\)}

11/16/10

\section*{Floating Point Addition}
\(\square\) Addition (and subtraction)
\[
\left( \pm F 1 \times 2^{E 1}\right)+\left( \pm F 2 \times 2^{E 2}\right)= \pm F 3 \times 2^{E 3}
\]
- Step 0: Restore the hidden bit in F1 and in F2
- Step 1: Align fractions by right shifting F2 by E1- E2 positions (assuming E1 \(\geq\) E2) keeping track of (three of) the bits shifted ou in G R and S
- Step 2: Add the resulting F2 to F1 to form F3
- Step 3: Normalize F3 (so it is in the form 1.XXXXX ...)

If \(F 1\) and \(F 2\) have the same sign \(\rightarrow F 3 \in[1,4) \rightarrow 1\) bit right shift \(F 3\) and increment E3 (check for overflow)
If \(F 1\) and \(F 2\) have different signs \(\rightarrow\) F3 may require many left shifts each time decrementing E3 (check for underflow)
- Step 4: Round F3 and possibly normalize F3 again
- Step 5: Rehide the most significant bit of F3 before storing the result
11/16/10
Sovoor 20101611 t3 st:25 \(25 \quad\) Imin CSE431P PSU

\section*{Floating Point Addition Example}
- Add
\(\left(0.5=1.0000 \times 2^{-1}\right)+\left(-0.4375=-1.1100 \times 2^{-2}\right)\)
- Step 0:
- Step 1:
- Step 2:
- Step 3:
- Step 4:
- Step 5:

11/16/10
5DVoos 20101611 t.3 sl:26

\section*{Floating Point Addition Example}
\(\square\) Add
\(\left(0.5=1.0000 \times 2^{-1}\right)+\left(-0.4375=-1.1100 \times 2^{-2}\right)\)
- Step 0: Hidden bits restored in the representation above
- Step 1: Shift significand with the smaller exponent (1.1100) right until its exponent matches the larger exponent (so once)
- Step 2: Add significands
\[
1.0000+(-0.111)=1.0000-0.111=0.001
\]
- Step 3: Normalize the sum, checking for exponent over/underflow \(0.001 \times 2^{-1}=0.010 \times 2^{-2}=. .=1.000 \times 2^{-4}\)
- Step 4: The sum is already rounded, so we're done
- Step 5: Rehide the hidden bit before storing 11/16/10
5DV008 20101611 t 3 s s:27 \(27 \quad\) Irwin CSE431 PSU
\(\qquad\)
\(\qquad\)
\(\qquad\)
\(\qquad\)
\(\qquad\)
\(\qquad\)

\section*{Floating Point Multiplication}
- Multiplication
\[
\left( \pm \mathrm{F} 1 \times 2^{\mathrm{E} 1}\right) \times\left( \pm \mathrm{F} 2 \times 2^{\mathrm{E} 2}\right)= \pm \mathrm{F} 3 \times 2^{\mathrm{E} 3}
\]
- Step 0: Restore the hidden bit in F1 and in F2
- Step 1: Add the two (biased) exponents and subtract the bias from the sum, so E1 + E2-127 = E3
also determine the sign of the product (which depends on the sign of the operands (most significant bits))
- Step 2: Multiply F1 by F2 to form a double precision F3
- Step 3: Normalize F3 (so it is in the form 1.XXXXX ...)

Since F1 and F2 come in normalized \(\rightarrow F 3 \in[1,4) \rightarrow 1\) bit right shift F3 and increment E3
Check for overflow/underflow
- Step 4: Round F3 and possibly normalize F3 again
- Step 5: Rehide the most significant bit of F3 before storing the 11/16/10 \({ }^{\text {result }}\)


\section*{Floating Point Multiplication Example}
- Multiply
\(\left(0.5=1.0000 \times 2^{-1}\right) \times\left(-0.4375=-1.1100 \times 2^{-2}\right)\)
- Step 0:
- Step 1:
- Step 2:
- Step 3:
- Step 4:
- Step 5:

11/16/10
5DV008 20101611 t:3 sl:29
\(29 \quad\) Irwin CSE431 PSU

\section*{Floating Point Multiplication Example}
- Multiply
\(\left(0.5=1.0000 \times 2^{-1}\right) \times\left(-0.4375=-1.1100 \times 2^{-2}\right)\)
- Step 0: Hidden bits restored in the representation above
- Step 1: Add the exponents (not in bias would be \(-1+(-2)=-3\) and in bias would be \((-1+127)+(-2+127)-127=(-1\)
\(-2)+(127+127-127)=-3+127=124\)
- Step 2: Multiply the significands
\(1.0000 \times 1.110=1.110000\)
- Step 3: Normalized the product, checking for exp over/underflow \(1.110000 \times 2^{-3}\) is already normalized
- Step 4: The product is already rounded, so we're done
- Step 5: Rehide the hidden bit before storing 11/16/10
5DVoos 20101611 t 3 s \(\mathrm{s}: 30 \quad 30 \quad\) Irwin CSE431 PSU
\(\qquad\)
\(\qquad\)
\(\qquad\)
\(\qquad\)
\(\qquad\)
\(\qquad\)

\section*{MIPS Floating Point Instructions}
\(\square\) MIPS has a separate Floating Point Register File
( \(\$ \mathrm{f} 0, \$ \mathrm{f} 1, \ldots, \$ \mathrm{f} 31\) ) (whose registers are used in pairs for double precision values) with special instructions to load to and store from them
lwc1 \$f1,54(\$s2) \#\$f1 = Memory[\$s2+54]
swc1 \$f1,58(\$s4) \#Memory[\$s4+58] = \$f1
\(\square\) And supports IEEE 754 single
add.s \$f2,\$f4,\$f6 \#\$f2 = \$f4 + \$f6
and double precision operations
add. d \$f2,\$f4,\$f6 \#\$f2||\$f3= \$f4||\$f5 + \$f6||\$f7
similarly for sub.s, sub.d, mul.s, mul.d, div.s, div.d
sovoos 20101611 t3 sl:3
\(31 \quad\) Imwin CsE431 PSU

\section*{MIPS Floating Point Instructions, Con't}
\(\square\) And floating point single precision comparison operations
\[
\begin{aligned}
& \text { c.x.s \$f2,\$f4 \#if( } \text { \$f2 }<\text { \$f4) cond=1; } \\
& \text { else cond=0 }
\end{aligned}
\]
\(\qquad\)
where \(x\) may be eq, neq, lt, le, gt, ge
and double precision comparison operations
\[
\begin{array}{r}
\text { c.x.d \$f2,\$f4 \#\$f2||\$f3< \$f4||\$f5 } \\
\text { cond=1; else cond=0 }
\end{array}
\]

\section*{\(\square\) And floating point branch operations}
\begin{tabular}{cr} 
bc1t 25 & \#if \((\) cond \(==1)\) \\
& go to PC \(+4+25\) \\
bc1f 25 & \#if (cond==0)
\end{tabular}
bc1f 25
\[
\text { go to } \mathrm{PC}+4+25
\]

11/16/10
5DVo08 20101611 t.3 sl:32
32
\({ }^{\text {Irwin CSE431 PSU }}\)
\begin{tabular}{|c|c|c|c|c|c|}
\hline \multicolumn{6}{|l|}{Frequency of Common MIPS Instructions} \\
\hline \multicolumn{6}{|l|}{\(\square\) Only included those with \(>3 \%\) and \(>1 \%\)} \\
\hline & SPECint & SPECfp & & SPECint & SPECfp \\
\hline addu & 5.2\% & 3.5\% & add.d & 0.0\% & 10.6\% \\
\hline addiu & 9.0\% & 7.2\% & sub.d & 0.0\% & 4.9\% \\
\hline or & 4.0\% & 1.2\% & mul.d & 0.0\% & 15.0\% \\
\hline sll & 4.4\% & 1.9\% & add.s & 0.0\% & 1.5\% \\
\hline lui & 3.3\% & 0.5\% & sub.s & 0.0\% & 1.8\% \\
\hline lw & 18.6\% & 5.8\% & mul.s & 0.0\% & 2.4\% \\
\hline Sw & 7.6\% & 2.0\% & 1.d & 0.0\% & 17.5\% \\
\hline lbu & 3.7\% & 0.1\% & s.d & 0.0\% & 4.9\% \\
\hline beq & 8.6\% & 2.2\% & \(1 . \mathrm{s}\) & 0.0\% & 4.2\% \\
\hline bne & 8.4\% & 1.4\% & s.s & 0.0\% & 1.1\% \\
\hline slt & 9.9\% & 2.3\% & lhu & 1.3\% & 0.0\% \\
\hline slti & 3.1\% & 0.3\% & & & \\
\hline 11/16/10u & 3.4\% & 0.8\% & & & \\
\hline V008 20 tototessiss & & & & 33 & Irwin CSE43 \\
\hline
\end{tabular}```

