## 5DV008

Computer Architecture
Umeå University
Department of Computing Science
Stephen J. Hegner
Topic 1: Introduction

These slides are mostly taken verbatim, or with minor changes, from those prepared by
Mary Jane Irwin (www.cse.psu.edu/~mji) of The Pennsylvania State University
[Adapted from Computer Organization and Design, $4^{4 h}$ Edition, Patterson \& Hennessy, © 2008, MK]

5DV008 20091107 ch:01 sl:1

## Key to the Slides

$\square$ The source of each slide is coded in the footer on the right side:

- Irwin CSE331 = slide by Mary Jane Irwin from the course CSE331 (Computer Organization and Design) at Pennsylvania State University.
- Irwin CSE431 = slide by Mary Jane Irwin from the course CSE431 (Computer Architecture) at Pennsylvania State University.
- Hegner UU = slide by Stephen J. Hegner at Umeå University.


## Quote for the Day

"I got the idea for the mouse while attending a talk at a computer conference. The speaker was so boring that I started daydreaming and hit upon the idea."

Doug Engelbart
http://en.wikipedia.org/wiki/Douglas_Engelbart
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$

$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$


| Year | $\mathbf{2 0 0 4}$ | $\mathbf{2 0 0 6}$ | $\mathbf{2 0 0 8}$ | $\mathbf{2 0 1 0}$ | $\mathbf{2 0 1 2}$ |
| :--- | :---: | :---: | :---: | :---: | :---: |
| Feature size (nm) | 90 | 65 | 45 | 32 | 22 |
| Intg. Capacity (BT) | 2 | 4 | 6 | 16 | 32 |

- Fun facts about 45 nm transistors
- 30 million can fit on the head of a pin
- You could fit more than 2,000 across the width of a human hair
- If car prices had fallen at the same rate as the price of a single transistor has since 1968, a new car today would cost about 1 cent

5DV008 20091107 ch:01 sl:7

$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$

## But What Happened to Clock Rates and Why?



## A Sea Change is at Hand

$\square$ The power challenge has forced a change in the design of microprocessors

- Since 2002 the rate of improvement in the response time of programs on desktop computers has slowed from a factor of 1.5 per year to less than a factor of 1.2 per year
- As of 2006 all desktop and server companies are shipping microprocessors with multiple processors cores - per chip

| Product | AMD <br> Barcelona | Intel <br> Nehalem | IBM Power 6 | Sun Niagara <br> 2 |
| :--- | :---: | :---: | :---: | :---: |
| Cores per chip | 4 | 4 | 2 | 8 |
| Clock rate | 2.5 GHz | $\sim 2.5 \mathrm{GHz} ?$ | 4.7 GHz | 1.4 GHz |
| Power | 120 W | $\sim 100 \mathrm{~W} ?$ | $\sim 100 \mathrm{~W} ?$ | 94 W |

- Plan of record is to double the number of cores per chip 11/2 엉 generation (about every two years)
50voos 20091107 ch:01 sl:10 Irwin CSE431 PSU


## AMD's Barcelona Multicore Chip



- Four out-oforder cores on one chip $\qquad$
$\qquad$
$\qquad$
$\qquad$ of caches (L1, L2, L3) on chip
- Integrated Northbridge

SDV008 20091107 ch:01 st:11 http://www.techwarelabs.com/reviews/processors/barcelona/ Irwin CSE431 PSU

## Performance Metrics

## - Purchasing perspective

- given a collection of machines, which has the
best performance?
- least cost?
best cost/performance?
- Design perspective
- faced with design options, which has the
best performance improvement?
least cost?
best cost/performance?
- Both require
- basis for comparison
- metric for evaluation
$\square$ Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$

11/23/09

Sovoos 20091107 ch:01 sl:12 $12 \quad$ Irwin CSE431 PSU

## Throughout versus Response Time

$\square$ Response time (execution time) - the time between the start and the completion of a task

- Important to individual users
$\square$ Throughput (bandwidth) - the total amount of work done in a given time
- Important to data center managers
$\square$ Will need different performance metrics as well as a different set of applications to benchmark embedded and desktop computers, which are more focused on response time, versus servers, which are more focused on throughput

13
Irwin CSE431 PSU

## Defining (Speed) Performance

- To maximize performance, need to minimize execution time

$$
\text { performance }_{\mathrm{x}}=1 \text { / execution_time } \mathrm{x}_{\mathrm{x}}
$$

If $X$ is $n$ times faster than $Y$, then

$$
\frac{\text { performance }_{x}}{\text { performance }}=\frac{\text { execution_time }}{Y} \text { _- }
$$

- Decreasing response time almost always improves throughput


## 11/23/09

SDV008 20091107 ch:01 sl:14 $\quad 14$

## A Relative Performance Example

- If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds, how much faster is $A$ than $B$ ?


## 11/23/09

SDV008 20091107 ch:01 sl:15 $15 \quad$ Irwin CSE431 PSU

## A Relative Performance Example

- If computer A runs a program in 10 seconds and computer $B$ runs the same program in 15 seconds, how much faster is $A$ than $B$ ?

We know that $A$ is $n$ times faster than $B$ if

$$
\frac{\text { performance }{ }_{A}}{\text { performance }}=\frac{\text { execution_time }}{\mathrm{B}} \mathrm{~B}-\mathrm{E}
$$

The performance ratio is $\frac{15}{10}=1.5$
So $A$ is 1.5 times as fast as $B$.

```
11/23/09
5DV008 20091107 ch:01 sl:16

\section*{Performance Factors}
\(\square\) CPU execution time (CPU time) - time the CPU spends \(\qquad\) working on a task
- Does not include time waiting for I/O or running other programs
\begin{tabular}{l} 
CPU execution time \\
for a program
\end{tabular}\(=\underset{\text { for a program }}{\# \text { CPU clock cycles }} \times\) clock cycle
or
CPU execution time = \# CPU clock cycles for a program for a program clock rate
\(\square\) Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required 1, for a program
SDVoos 20091107 ch:01 sl:17 \(17 \quad\) Irwin CSE431 PSU

\section*{Review: Machine Clock Rate}
\(\square\) Clock rate (clock cycles per second in MHz or GHz ) is inverse of clock cycle time (clock period)

\[
\begin{aligned}
10 \text { nsec clock cycle } & =>100 \mathrm{MHz} \text { clock rate } \\
5 \text { nsec clock cycle } & =>200 \mathrm{MHz} \text { clock rate } \\
2 \text { nsec clock cycle } & =>500 \mathrm{MHz} \text { clock rate } \\
1 \text { nsec }\left(10^{9}\right) \text { clock cycle } & =>1 \mathrm{GHz}\left(10^{9}\right) \text { clock rate } \\
500 \text { psec clock cycle } & =>2 \mathrm{GHz} \text { clock rate } \\
250 \text { psec clock cycle } & =>4 \mathrm{GHz} \text { clock rate } \\
200 \text { psec clock cycle } & =>5 \mathrm{GHz} \text { clock rate }{ }_{18}
\end{aligned}
\]
\(\square\) A program runs on computer A with a 2 GHz clock in 10 seconds. What clock rate must a computer B run at to run this program in 6 seconds? Unfortunately, to accomplish this, computer B will require 1.2 times as many clock cycles as computer A to run the program.

\section*{11/23/09}

5DV008 20091107 ch:01 sl:19
19
Irwin CSE431 PSU

\section*{Improving Performance Example}
- A program runs on computer A with a 2 GHz clock in 10 seconds. What clock rate must computer B run at to run this program in 6 seconds? Unfortunately, to accomplish this, computer B will require 1.2 times as many clock cycles as computer A to run the program.
\[
\text { CPU time }{ }_{\mathrm{A}}=\frac{\text { CPU clock cycles }}{\text { clock }} \text { rate }{ }_{\mathrm{A}}
\]

CPU clock cycles \(_{A}=10 \sec \times 2 \times 10^{9}\) cycles \(/ \mathrm{sec}\) \(=20 \times 10^{9}\) cycles



\section*{Using the Performance Equation}
\(\square\) Computers \(A\) and \(B\) implement the same ISA. Computer A has a clock cycle time of 250 ps and an effective CPI of 2.0 for some program and computer \(B\) has a clock cycle time of 500 ps and an effective CPI of 1.2 for the same program. Which computer is faster and by how much?

\section*{Using the Performance Equation}
\(\square\) Computers \(A\) and \(B\) implement the same ISA. Computer A has a clock cycle time of 250 ps and an effective CPI of 2.0 for some program and computer B has a clock cycle time of 500 ps and an effective CPI of 1.2 for the same program. Which computer is faster and by how much?

Each computer executes the same number of instructions, l, so
\[
\begin{aligned}
& \mathrm{CPU} \operatorname{time}_{\mathrm{A}}=/ \times 2.0 \times 250 \mathrm{ps}=500 \times / \mathrm{ps} \\
& \mathrm{CPU} \operatorname{time}_{\mathrm{B}}=/ \times 1.2 \times 500 \mathrm{ps}=600 \times / \mathrm{ps}
\end{aligned}
\]

Clearly, A is faster ... by the ratio of execution times


\section*{Effective (Average) CPI}
\(\square\) Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging

Overall effective CPI \(=\sum_{i=1}^{n}\left(\right.\) CPI \(\left._{i} \times I C_{i}\right)\)
- Where \(I C_{i}\) is the count (percentage) of the number of instructions of class i executed
- \(\mathrm{CPI}_{\mathrm{i}}\) is the (average) number of clock cycles per instruction for that instruction class
- \(n\) is the number of instruction classes
\(\square\) The overall effective CPI varies by instruction mix - a measure of the dynamic frequency of instructions across one or many programs

\section*{11/23/09}

\section*{THE Performance Equation}
- Our basic performance equation is then

\(\square\) These equations separate the three key factors that affect performance
- Can measure the CPU execution time by running the program
- The clock rate is usually given
- Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details
- CPI varies by instruction type and ISA implementation for which 11/23/09we must know the implementation details
50voor 20091107 ch:01 sl:25 25 Irwin CSE431 PSU
Determinates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle
\begin{tabular}{|l|l|l|l|}
\hline & \begin{tabular}{c} 
Instruction_ \\
count
\end{tabular} & CPI & clock_cycle \\
\hline Algorithm & & & \\
\hline \begin{tabular}{l} 
Programming \\
language
\end{tabular} & & & \\
\hline Compiler & & & \\
\hline ISA & & & \\
\hline \begin{tabular}{l} 
Core \\
organization
\end{tabular} & & & \\
\hline Technology & & & \\
\hline
\end{tabular}
11/23/09 \begin{tabular}{l} 
sovoos 2009107 ch:01 st:26
\end{tabular}

\section*{Determinates of CPU Performance}

CPU time = Instruction_count x CPI x clock_cycle
\(\qquad\)
\(\qquad\)
\begin{tabular}{|l|c|c|c|}
\hline & \begin{tabular}{c} 
Instruction_ \\
count
\end{tabular} & CPI & clock_cycle \\
\hline Algorithm & x & x & \\
\hline \begin{tabular}{l} 
Programming \\
language
\end{tabular} & x & x & \\
\hline Compiler & x & x & \\
\hline ISA & x & x & x \\
\hline \begin{tabular}{l} 
Core \\
organization
\end{tabular} & x & x \\
\hline Technology & & & x \\
\hline
\end{tabular}

\section*{A Simple Example}
\begin{tabular}{|l|r|r|r|}
\hline \multicolumn{1}{|c|}{ Op } & Freq & CPI \(_{i}\) & Freq \(\times\) CPI \(_{i}\) \\
\hline ALU & \(50 \%\) & 1 & \\
\hline Load & \(20 \%\) & 5 & \\
\hline Store & \(10 \%\) & 3 & \\
\hline Branch & \(20 \%\) & 2 & \\
\hline \multicolumn{4}{|l|}{} \\
\hline
\end{tabular}
\(\square\) How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
- How does this compare with using branch prediction to shave a cycle off the branch time?
- What if two ALU instructions could be executed at once?

11/23/09
28
Irwin CSE431 PSU

\section*{A Simple Example}
\begin{tabular}{|c|c|c|c|c|c|}
\hline Op & Freq & CPI \({ }_{\text {i }}\) & Freq \(\times \mathrm{CPI}_{\mathrm{i}}\) & \multirow[b]{2}{*}{. 5} & \multirow[b]{2}{*}{. 25} \\
\hline ALU & 50\% & 1 & . 5 & & \\
\hline Load & 20\% & 5 & 1.0 & 1.0 & 1.0 \\
\hline Store & 10\% & 3 & . 3 & . 3 & . 3 \\
\hline Branch & 20\% & 2 & 4 & . 2 & . 4 \\
\hline & & & \(\Sigma=2.2\) & 2.0 & 1.95 \\
\hline
\end{tabular}
\(\square\) How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?

CPU time new \(=1.6 \times\) IC \(\times\) CC so \(2.2 / 1.6\) means \(37.5 \%\) faster \(\square\) How does this compare with using branch prediction to shave a cycle off the branch time?

CPU time new \(=2.0 \times\) IC \(\times\) CC so \(2.2 / 2.0\) means \(10 \%\) faste \(\square\) What if two ALU instructions could be executed at once?
\(11 / 23 / 09\) CPU time new \(=1.95 \times\) IC \(\times\) CC so \(2.2 / 1.95\) means \(12.8 \%\) faster 5DV008 20091107 ch:01 sl:29

\section*{Workloads and Benchmarks}
\(\square\) Benchmarks - a set of programs that form a "workload" specifically chosen to measure performance
- SPEC (System Performance Evaluation Cooperative) creates standard sets of benchmarks starting with SPEC89. The latest is SPEC CPU2006 which consists of 12 integer benchmarks (CINT2006) and 17 floatingpoint benchmarks (CFP2006).

> www.spec.org
- There are also benchmark collections for power workloads (SPECpower_ssj2008), for mail workloads (SPECmail2008), for multtimedia workloads (mediabench), ...

\section*{11/23/09}

5DV008 20091107 ch:01 sl:30
30
\begin{tabular}{|l|c|c|c|c|c|}
\hline Name & ICx10 \(^{9}\) & CPI & ExTime & RefTime & \begin{tabular}{c} 
SPEC \\
ratio
\end{tabular} \\
\hline perl & 2,1118 & 0.75 & 637 & 9,770 & 15.3 \\
\hline bzip2 & 2,389 & 0.85 & 817 & 9,650 & 11.8 \\
\hline gcc & 1,050 & 1.72 & 724 & 8,050 & 11.1 \\
\hline mcf & 336 & 10.00 & 1,345 & 9,120 & 6.8 \\
\hline go & 1,658 & 1.09 & 721 & 10,490 & 14.6 \\
\hline hmmer & 2,783 & 0.80 & 890 & 9,330 & 10.5 \\
\hline sjeng & 2,176 & 0.96 & 837 & 12,100 & 14.5 \\
\hline libquantum & 1,623 & 1.61 & 1,047 & 20,720 & 19.8 \\
\hline h264avc & 3,102 & 0.80 & 993 & 22,130 & 22.3 \\
\hline omnetpp & 587 & 2.94 & 690 & 6,250 & 9.1 \\
\hline astar & 1,082 & 1.79 & 773 & 7,020 & 9.1 \\
\hline xalancbmk & 1,058 & 2.70 & 1,143 & 6,900 & 6.0 \\
\hline \multicolumn{1}{|c|}{ Geometric Mean } & & & & 11.7 \\
\hline
\end{tabular}

5DV008 20091107 ch:01 sl:31
\({ }^{1}\) rwin CSE431 PSU

\section*{Comparing and Summarizing Performance}
- How do we summarize the performance for benchmark set with a single number?
- First the execution times are normalized giving the "SPEC ratio" (bigger is faster, i.e., SPEC ratio is the inverse of execution time)
- The SPEC ratios are then "averaged" using the geometric mean (GM)
\[
\mathrm{GM}=\sqrt[n]{\prod_{i=1}^{n} \text { SPEC ratio } i}
\]
\(\square\) Guiding principle in reporting performance measurements is reproducibility - list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))

\section*{11/23/09}

SDV008 20091107 ch:01 sl:32 Irwin CSE431 PSU

\section*{Other Performance Metrics}
- Power consumption - especially in the embedded market where battery life is important
- For power-limited applications, the most important metric is energy efficiency
\(\qquad\)
\(\qquad\)


\footnotetext{
11/23/(
}

Eanchmark and power mod
5DV008 20091107 ch:01 sl:33 rwin CSE431 PSU
embedded growth >> desktop growth

- Where else are embedded processors found? 11/23/09
CSE331 W01.34
\(\qquad\)
\(\qquad\)
\(\qquad\)
\(\qquad\)
\(\qquad\)
\(\qquad\)
\(\qquad\)

\section*{Summary: Evaluating ISAs}
- Design-time metrics:
- Can it be implemented, in how long, at what cost?
- Can it be programmed? Ease of compilation?

\section*{- Static Metrics:}
- How many bytes does the program occupy in memory?
- Dynamic Metrics:
- How many instructions are executed? How many bytes does the processor fetch to execute the program?
- How many clocks are required per instruction?
- How "lean" a clock is practical?

Best Metric: Time to execute the program!
depends on the instructions set, the processor organization, and compilation

techniques.
11/23/09
Sovoor 20091107 ch:01 sl:35 \(35 \quad\) Irwin CSE431 PSU```

