### 5DV008 Computer Architecture Umeå University Department of Computing Science Stephen J. Hegner

otophon 0. noghor

Topic 4: The Processor Part A: Basic control

These slides are mostly taken verbatim, or with minor changes, from those prepared by Mary Jane Irwin (www.cse.psu.edu/~mji) of The Pennsylvania State University

[Adapted from Computer Organization and Design, 4<sup>th</sup> Edition, Patterson & Hennessy, © 2008, MK]

5DV008 20092411 t:04A sl:1

Irwin CSE431 PSU

### Key to the Slides

- The source of each slide is coded in the footer on the right side:
  - Irwin CSE331 = slide by Mary Jane Irwin from the course CSE331 (Computer Organization and Design) at Pennsylvania State University.
  - Irwin CSE431 = slide by Mary Jane Irwin from the course CSE431 (Computer Architecture) at Pennsylvania State University.
  - Hegner UU = slide by Stephen J. Hegner at Umeå University.

5DV008 20092411 t:4A sl:2

Hegner UU

### Review: MIPS (RISC) Design Principles

□ Simplicity favors regularity

- fixed size instructions
- small number of instruction formats
- opcode always the first 6 bits

Smaller is faster

- Iimited instruction set
  - limited number of registers in register file
  - Imited number of addressing modes
- Make the common case fast
  - arithmetic operands from the register file (load-store machine)
    allow instructions to contain immediate operands

### Good design demands good compromises

three instruction formats

5DV008 20092411 t:04A sl:3

# The Processor: Datapath & Control Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j Generic implementation use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) decode the instruction (and read registers)

- execute the instruction
- All instructions (except j) use the ALU after reading the registers

How? memory-reference? arithmetic? control flow?



| CIOCK Edge Occurs          |                  |
|----------------------------|------------------|
| 5DV008 20092411 t:04A sl:5 | Irwin CSE431 PSU |







# Executing Load and Store Operations Load and store operations involves compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction

 store value (read from the Register File during decode) written to the Data Memory

• load value, read from the Data Memory, written to the Register



































# Instruction Times (Critical Paths)

What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except:

- Instruction and Data Memory (200 ps)
- ALU and adders (200 ps)
- Register File access (reads or writes) (100 ps)

| Instr.     | I Mem | Reg Rd | ALU Op | D Mem | Reg Wr | Total |
|------------|-------|--------|--------|-------|--------|-------|
| R-<br>type |       |        |        |       |        |       |
| load       |       |        |        |       |        |       |
| store      |       |        |        |       |        |       |
| beq        |       |        |        |       |        |       |
| jump       |       |        |        |       |        |       |

5DV008 20092411 t:04A sl:17

Irwin CSE431 PSU

## Instruction Critical Paths

□ What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except:

- Instruction and Data Memory (200 ps)
- ALU and adders (200 ps)
- Register File access (reads or writes) (100 ps)

| Instr.     | I Mem | Reg Rd | ALU Op | D Mem | Reg Wr | Total |
|------------|-------|--------|--------|-------|--------|-------|
| R-<br>type | 200   | 100    | 200    |       | 100    | 600   |
| load       | 200   | 100    | 200    | 200   | 100    | 800   |
| store      | 200   | 100    | 200    | 200   |        | 700   |
| beq        | 200   | 100    | 200    |       |        | 500   |
| jump       | 200   |        |        |       |        | 200   |

5DV008 20092411 t:04A sl:18



### How Can We Make It Faster?

Start fetching and executing the next instruction before the current one has completed

- Pipelining (all?) modern processors are pipelined for performance
- Remember the performance equation: CPU time = CPI \* CC \* IC
- Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages
  - A five stage pipeline is nearly five times as fast because the CC is nearly five times as fast

Fetch (and execute) more than one instruction at a time

• Superscalar processing – stay tuned

5DV008 20092411 t:04A sl:20

5DV008 20092411 t:04A sl:19

Irwin CSE431 PSU







- for some instructions, some stages are wasted cycles (i.e., nothing is done during that cycle for that instruction)
- 5DV008 20092411 t:04A si:22 Invin CSE431 PSU



To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle case). Why ?

How long does each take to complete 1,000,000 adds ?
SDV008 20082411 LG4A dit23
Invin CSE431 PSU

### **Pipelining the MIPS ISA**

### What makes it easy

- all instructions are the same length (32 bits)
- can fetch in the 1s stage and decode in the  $2^{\ensuremath{\mbox{tl}}}$  stage
- few instruction formats (three) with symmetry across formats
   can begin reading register file in 2<sup>e</sup> stage
- memory operations occur only in loads and stores
   can use the execute stage to calculate memory addresses
- each instruction writes at most one result (i.e., changes the machine state) and does it in the last few pipeline stages (MEM or WB)
- operands must be aligned in memory so a single data transfer takes only one data memory access

5DV008 20092411 t:04A sl:24









- IF Stage: read Instr Memory (always asserted) and write PC (on System Clock)
- □ ID Stage: no optional control signals to set

|     | EX<br>Stage |            |            |            | MEM<br>Stage |             |              | WB<br>Stage  |              |
|-----|-------------|------------|------------|------------|--------------|-------------|--------------|--------------|--------------|
|     | Reg<br>Dst  | ALU<br>Op1 | ALU<br>Op0 | ALU<br>Src | Brch         | Mem<br>Read | Mem<br>Write | Reg<br>Write | Mem<br>toReg |
| R   | 1           | 1          | 0          | 0          | 0            | 0           | 0            | 1            | 0            |
| lw  | 0           | 0          | 0          | 1          | 0            | 1           | 0            | 1            | 1            |
| SW  | Х           | 0          | 0          | 1          | 0            | 0           | 1            | 0            | Х            |
| beq | Х           | 0          | 1          | 0          | 1            | 0           | 0            | 0            | Х            |





Can help with answering questions like:

- How many cycles does it take to execute this code?
- What is the ALU doing during cycle 4?
- Is there a hazard, why does it occur, and how can it be fixed?

5DV008 20092411 t:04A sl:28



### Can Pipelining Get Us Into Trouble?

### □ Yes: Pipeline Hazards

- structural hazards: attempt to use the same resource by two different instructions at the same time
- data hazards: attempt to use data before it is ready
   An instruction's source operand(s) are produced by a prior instruction still in the pipeline
- control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated
  - branch and jump instructions, exceptions

### Can usually resolve hazards by waiting

- pipeline control must detect the hazard
- and take action to resolve hazards

5DV008 20092411 t:04A sl:30

Irwin CSE431 PSU





























### Summarv

5DV008 20092411 t:04A sl:37

- All modern day processors use pipelining
- Pipelining doesn't help latency of single task, it helps throughput of entire workload
- Potential speedup: a CPI of 1 and fast a CC
- Pipeline rate limited by slowest pipeline stage
  - Unbalanced pipe stages makes for inefficiencies
    - The time to "fill" pipeline and time to "drain" it can impact speedup for deep pipelines and short code runs
- Must detect and resolve hazards
  - Stalling negatively affects CPI (makes CPI less than the ideal of 1)

5DV008 20092411 t:04A sl:39

Irwin CSE431 PSU