# 5DV118 Computer Organization and Architecture Umeå University Department of Computing Science

Stephen J. Hegner

**Topic 1: Introduction** 

These slides are mostly taken verbatim, or with minor changes, from those prepared by

Mary Jane Irwin (www.cse.psu.edu/~mji)

of The Pennsylvania State University

[Adapted from Computer Organization and Design, 4th Edition, Patterson & Hennessy, © 2008, MK]

5DV118 20121106 ch:01 sl:1 Hegner UU

#### **Key to the Slides**

- The source of each slide is coded in the footer on the right side:
  - Irwin CSE331 = slide by Mary Jane Irwin from the course CSE331 (Computer Organization and Design) at Pennsylvania State University.
  - Irwin CSE431 = slide by Mary Jane Irwin from the course CSE431 (Computer Architecture) at Pennsylvania State University.
  - Hegner UU = slide by Stephen J. Hegner at Umeå University.

5DV118 20121106 ch:01 sl:2 Hegner UU

# **Quote for the Day**

"I got the idea for the mouse while attending a talk at a computer conference. The speaker was so boring that I started daydreaming and hit upon the idea."

Doug Engelbart

http://en.wikipedia.org/wiki/Douglas Engelbart

5DV118 20121106 ch:01 sl:3

# **Intel 4004 Microprocessor**



1971

0.2 MHz clock
3 mm² die
10,000 nm feature size
~2,300 transistors
2mW power

# <u>Mo</u>ore's Law

feature size

die size

Year of Introduction

In 1965, Intel's Gordon Moore predicted that the number of transistors that can be integrated on single chip would double about every two years



Note: Vertical scale of chart not proportional to actual Transistor count.

1980

Courtesy, Intel®

1990

# Intel Pentium (IV) Microprocessor



2001

30 (~2<sup>5</sup>) years

1.7 GHz clock 8500x faster

271 mm<sup>2</sup> die 90x bigger die

180 nm feature size 55x smaller

~42M transistors 18,000x more T's

64W power 32,000x (2<sup>15</sup>) more power

# Technology scaling road map (ITRS)

| Year                | 2004 | 2006 | 2008 | 2010 | 2012 |
|---------------------|------|------|------|------|------|
| Feature size (nm)   | 90   | 65   | 45   | 32   | 22   |
| Intg. Capacity (BT) | 2    | 4    | 6    | 16   | 32   |

#### Fun facts about 45nm transistors

- 30 million can fit on the head of a pin
- You could fit more than 2,000 across the width of a human hair
- If car prices had fallen at the same rate as the price of a single transistor has since 1968, a new car today would cost about 1 cent

# **Another Example of Moore's Law Impact**

#### DRAM capacity growth over 3 decades



#### **But There Is Wirth's Law ...**

Niklaus Wirth, the famous software designer, once observed:

# Software is getting slower more rapidly than hardware becomes faster.

□ There are a number of variants, attributed to people such as Larry Page and Bill Gates, among others.

5DV118 20121106 ch:01 sl:9 Hegner UU

# **But What Happened to Clock Rates and Why?**



# A Sea Change is at Hand

- The power challenge has forced a change in the design of microprocessors
  - Since 2002 the rate of improvement in the response time of programs on desktop computers has slowed from a factor of 1.5 per year to less than a factor of 1.2 per year
- As of 2006 all desktop and server companies are shipping microprocessors with multiple processors – cores – per chip

| Product        | AMD<br>Barcelona | Intel<br>Nehalem | IBM Power 6 | Sun Niagara<br>2 |
|----------------|------------------|------------------|-------------|------------------|
| Cores per chip | 4                | 4                | 2           | 8                |
| Clock rate     | 2.5 GHz          | ~2.5 GHz?        | 4.7 GHz     | 1.4 GHz          |
| Power          | 120 W            | ~100 W?          | ~100 W?     | 94 W             |

Plan of record is to double the number of cores per chip per generation (about every two years)

# AMD's Barcelona Multicore Chip (Sept. 2007)



- Four out-oforder cores on one chip
- 1.9 GHz clock rate
- 65nm technology
- □ Three levels of caches (L1, L2, L3) on chip
- Integrated Northbridge

# The Oracle/(Sun) SPARC T4 (2011)

- Example of a modern high-end processor
  - 8 cores, up to 8 threads per core = 64 threads total
  - Scalability up to 4 sockets with no additional silicon necessary
  - Die size 403 mm²
  - Core size 15.4mm<sup>2</sup>
  - 2.85 3.0 GHz
  - 40 nm
  - Caches:
    - -L1: 16KB instruction, 16KB data, per core
    - -L2: 128KB, per core
    - -L3: 4MB, 8 banks, 16-way set-associative, unified
  - On-chip encryption hardware
  - On-chip PCIe
  - TDP (Total Power Dissipation) 240 Watts maximum

5DV118 20121106 ch:01 sl:13 Hegner UU

#### The AMD Fusion Series (2011 - )

- A series of APUs (Accelerated Processing Units)
- Processor(s) plus GPU (Graphics Processing Unit) on one chip
- Variants:

| Classification | Application | TDP      | Fab  | Cores | Clock GHz |
|----------------|-------------|----------|------|-------|-----------|
| Desktop        | workstation | 65-100 W | 32nm | 2-4   | 2.1-3.0   |
| Mobile         | laptop      | 35-45 W  | 32nm | 2-4   | 1.8-1.9   |
| Ultra-portable | netbook     | 9-18 W   | 40nm | 1-2   | 1.5-1.65  |

- Example: E-350 Zacate codename, ultra-portable
  - January 2011
  - 2 cores, 1.6 GHz, 18W, 40 nm
  - 512KB 16-way L2 cache per core
  - HD 6310 GPU

5DV118 20121106 ch:01 sl:14 Hegner UU

#### **Performance Metrics**

- Purchasing perspective
  - given a collection of machines, which has the
    - best performance ?
    - least cost?
    - best cost/performance?
- Design perspective
  - faced with design options, which has the
    - best performance improvement ?
    - least cost?
    - best cost/performance?
- Both require
  - basis for comparison
  - metric for evaluation
- Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors

# **Throughput versus Response Time**

- Response time (execution time) the time between the start and the completion of a task
  - Important to individual users
- Throughput (bandwidth) the total amount of work done in a given time
  - Important to data center managers

■ Will need different performance metrics as well as a different set of applications to benchmark embedded and desktop computers, which are more focused on response time, versus servers, which are more focused on throughput

# **Defining (Speed) Performance**

□ To maximize performance, need to minimize execution time

If X is n times as fast as Y, then

Decreasing response time almost always improves throughput

# **A Relative Performance Example**

□ If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds, how much faster is A than B?

5DV118 20121106 ch:01 sl:18 Irwin CSE331 PSU

# **A Relative Performance Example**

If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds, how much faster is A than B?

We know that A is n times as fast as B if

The performance ratio is 
$$\frac{15}{10} = 1.5$$

So A is 1.5 times as fast as B (or A is 50% faster than B).

#### **Performance Factors**

- CPU execution time (CPU time) time the CPU spends working on a task
  - Does not include time waiting for I/O or running other programs

or

Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program

#### **Review: Machine Clock Rate**

Clock rate (clock cycles per second in MHz or GHz) is inverse of clock cycle time (clock period)

10 nsec clock cycle => 100 MHz clock rate

5 nsec clock cycle => 200 MHz clock rate

2 nsec clock cycle => 500 MHz clock rate

1 nsec ( $10^{-9}$ ) clock cycle => 1 GHz ( $10^{9}$ ) clock rate

500 psec clock cycle => 2 GHz clock rate

250 psec clock cycle => 4 GHz clock rate

200 psec clock cycle => 5 GHz clock rate

# Improving Performance Example

□ A program runs on computer A with a 2 GHz clock in 10 seconds. What clock rate must a computer B run at to run this program in 6 seconds? Unfortunately, to accomplish this, computer B will require 1.2 times as many clock cycles as computer A to run the program.

# **Improving Performance Example**

□ A program runs on computer A with a 2 GHz clock in 10 seconds. What clock rate must computer B run at to run this program in 6 seconds? Unfortunately, to accomplish this, computer B will require 1.2 times as many clock cycles as computer A to run the program.

CPU clock cycles<sub>A</sub> = 10 sec x 2 x 10<sup>9</sup> cycles/sec = 
$$20 \times 10^9$$
 cycles

CPU time<sub>B</sub> = 
$$\frac{1.2 \times 20 \times 10^9 \text{ cycles}}{\text{clock rate}_{B}}$$

clock rate<sub>B</sub> = 
$$\frac{1.2 \times 20 \times 10^9 \text{ cycles}}{6 \text{ seconds}}$$
 = 4 GHz

# **Clock Cycles per Instruction**

- Not all instructions take the same amount of time to execute
  - One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction

- Clock cycles per instruction (CPI) the average number of clock cycles each instruction takes to execute
  - A way to compare two different implementations of the same ISA

|     | CPI for this instruction class |   |   |  |  |
|-----|--------------------------------|---|---|--|--|
|     | A B C                          |   |   |  |  |
| CPI | 1                              | 2 | 3 |  |  |

# **Using the Performance Equation**

□ Computers A and B implement the same ISA. Computer A has a clock cycle time of 250 ps and an effective CPI of 2.0 for some program and computer B has a clock cycle time of 500 ps and an effective CPI of 1.2 for the same program. Which computer is faster and by how much?

5DV118 20121106 ch:01 sl:25

# **Using the Performance Equation**

□ Computers A and B implement the same ISA. Computer A has a clock cycle time of 250 ps and an effective CPI of 2.0 for some program and computer B has a clock cycle time of 500 ps and an effective CPI of 1.2 for the same program. Which computer is faster and by how much?

Each computer executes the same number of instructions, *I*, so

CPU time<sub>A</sub> = 
$$I \times 2.0 \times 250 \text{ ps} = 500 \times I \text{ ps}$$
  
CPU time<sub>B</sub> =  $I \times 1.2 \times 500 \text{ ps} = 600 \times I \text{ ps}$ 

Clearly, A is faster ... by the ratio of execution times

performance<sub>A</sub> execution\_time<sub>B</sub> 
$$600 \times I \text{ ps}$$
 = 1.2 performance<sub>B</sub> execution\_time<sub>A</sub>  $500 \times I \text{ ps}$ 

# Effective (Average) CPI

Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging

Overall effective CPI = 
$$\sum_{i=1}^{n} (CPI_i \times IC_i)$$

- Where IC<sub>i</sub> is the count (percentage) of the number of instructions of class i executed
- CPI<sub>i</sub> is the (average) number of clock cycles per instruction for that instruction class
- n is the number of instruction classes
- □ The overall effective CPI varies by instruction mix a measure of the dynamic frequency of instructions across one or many programs

#### **THE Performance Equation**

Our basic performance equation is then

- These equations separate the three key factors that affect performance
  - Can measure the CPU execution time by running the program
  - The clock rate is usually given
  - Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details
  - CPI varies by instruction type and ISA implementation for which we must know the implementation details

# **Determinates of CPU Performance**

CPU time = Instruction\_count x CPI x clock\_cycle

|                      | Instruction_<br>count | CPI | clock_cycle |
|----------------------|-----------------------|-----|-------------|
| Algorithm            |                       |     |             |
| Programming language |                       |     |             |
| Compiler             |                       |     |             |
| ISA                  |                       |     |             |
| Core organization    |                       |     |             |
| Technology           |                       |     |             |

# **Determinates of CPU Performance**

CPU time = Instruction\_count x CPI x clock\_cycle

|                      | Instruction_<br>count | CPI | clock_cycle |
|----------------------|-----------------------|-----|-------------|
| Algorithm            | X                     | X   |             |
| Programming language | X                     | X   |             |
| Compiler             | X                     | X   |             |
| ISA                  | X                     | X   | X           |
| Core organization    |                       | X   | X           |
| Technology           |                       |     | X           |

# **A Simple Example**

| Ор     | Freq | CPI <sub>i</sub> | Freq x CPI <sub>i</sub> |
|--------|------|------------------|-------------------------|
| ALU    | 50%  | 1                |                         |
| Load   | 20%  | 5                |                         |
| Store  | 10%  | 3                |                         |
| Branch | 20%  | 2                |                         |
|        |      |                  | $\Sigma =$              |

- How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
- How does this compare with using branch prediction to shave a cycle off the branch time?
- What if two ALU instructions could be executed at once?

# A Simple Example

| Ор     | Freq | CPI <sub>i</sub> | Freq x     | CPI <sub>i</sub> |     |     |      |
|--------|------|------------------|------------|------------------|-----|-----|------|
| ALU    | 50%  | 1                |            | .5               | .5  | .5  | .25  |
| Load   | 20%  | 5                |            | 1.0              | .4  | 1.0 | 1.0  |
| Store  | 10%  | 3                |            | .3               | .3  | .3  | .3   |
| Branch | 20%  | 2                |            | .4               | .4  | .2  | .4   |
|        |      |                  | $\Sigma =$ | 2.2              | 1.6 | 2.0 | 1.95 |

□ How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?

CPU time new =  $1.6 \times IC \times CC$  so 2.2/1.6 means 37.5% faster

□ How does this compare with using branch prediction to shave a cycle off the branch time?

CPU time new =  $2.0 \times IC \times CC$  so 2.2/2.0 means 10% faster

What if two ALU instructions could be executed at once?

CPU time new =  $1.95 \times IC \times CC$  so 2.2/1.95 means 12.8% faster

#### **Workloads and Benchmarks**

- Benchmarks a set of programs that form a "workload" specifically chosen to measure performance
- □ SPEC (System Performance Evaluation Cooperative) creates standard sets of benchmarks starting with SPEC89. The latest is SPEC CPU2006 which consists of 12 integer benchmarks (CINT2006) and 17 floating-point benchmarks (CFP2006).

www.spec.org

There are also benchmark collections for power workloads (SPECpower\_ssj2008), for mail workloads (SPECmail2008), for multimedia workloads (mediabench),

. . .

# SPEC CINT2006 on Barcelona (CC = 0.4 x 10°)

| Name       | ICx10 <sup>9</sup> | СРІ   | ExTime | RefTime | SPEC<br>ratio |
|------------|--------------------|-------|--------|---------|---------------|
| perl       | 2,1118             | 0.75  | 637    | 9,770   | 15.3          |
| bzip2      | 2,389              | 0.85  | 817    | 9,650   | 11.8          |
| gcc        | 1,050              | 1.72  | 724    | 8,050   | 11.1          |
| mcf        | 336                | 10.00 | 1,345  | 9,120   | 6.8           |
| go         | 1,658              | 1.09  | 721    | 10,490  | 14.6          |
| hmmer      | 2,783              | 0.80  | 890    | 9,330   | 10.5          |
| sjeng      | 2,176              | 0.96  | 837    | 12,100  | 14.5          |
| libquantum | 1,623              | 1.61  | 1,047  | 20,720  | 19.8          |
| h264avc    | 3,102              | 0.80  | 993    | 22,130  | 22.3          |
| omnetpp    | 587                | 2.94  | 690    | 6,250   | 9.1           |
| astar      | 1,082              | 1.79  | 773    | 7,020   | 9.1           |
| xalancbmk  | 1,058              | 2.70  | 1,143  | 6,900   | 6.0           |
| Geometi    | ric Mean           |       |        |         | 11.7          |

# Comparing and Summarizing Performance

- □ How do we summarize the performance for benchmark set with a single number?
  - First the execution times are normalized giving the "SPEC ratio" (bigger is faster, i.e., SPEC ratio is the inverse of execution time)
  - The SPEC ratios are then "averaged" using the geometric mean (GM)

$$GM = \int_{i=1}^{n} SPEC ratio_{i}$$

□ Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))

#### **Other Performance Metrics**

- Power consumption especially in the embedded market where battery life is important
  - For power-limited applications, the most important metric is energy efficiency



# **Growth in Cell Phone Sales (Embedded)**

embedded growth >> desktop growth



Where else are embedded processors found?

# **Summary: Evaluating ISAs**

- Design-time metrics:
  - Can it be implemented, in how long, at what cost?
  - Can it be programmed? Ease of compilation?
- Static Metrics:
  - How many bytes does the program occupy in memory?
- Dynamic Metrics:
  - How many instructions are executed? How many bytes does the processor fetch to execute the program?
  - How many clocks are required per instruction?
  - How "lean" a clock is practical?

Best Metric: Time to execute the program!

depends on the instructions set, the processor organization, and compilation techniques.

