## 5DV008 Computer Architecture Umeå University Department of Computing Science Stephen J. Hegner

# Topic 5: The Memory Hierarchy Part A: Caches

These slides are mostly taken verbatim, or with minor changes, from those prepared by

#### Mary Jane Irwin (www.cse.psu.edu/~mji)

of The Pennsylvania State University [Adapted from *Computer Organization and Design, 4<sup>th</sup> Edition,* Patterson & Hennessy, © 2008, MK]

5DV008 20090212 t:5A sl:1

Hegner UU

## Key to the Slides

- The source of each slide is coded in the footer on the right side:
  - Irwin CSE331 = slide by Mary Jane Irwin from the course CSE331 (Computer Organization and Design) at Pennsylvania State University.
  - Irwin CSE431 = slide by Mary Jane Irwin from the course CSE431 (Computer Architecture) at Pennsylvania State University.
  - Hegner UU = slide by Stephen J. Hegner at Umeå University.

5DV008 20090212 t:5A sl:2

Hegner UU













# The Memory Hierarchy Goal

- Fact: Large memories are slow and fast memories are small
- How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)?
  - With hierarchy
  - With parallelism

5DV008 20090212 t:5A sl:6

# A Typical Memory Hierarchy

Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology





### Memory Hierarchy Technologies

- Caches use SRAM for speed and technology compatibility
  - Fast (typical access times of 0.5 to 2.5 nsec)
  - Low density (6 transistor cells), higher power, expensive (\$2000 to \$5000 per GB in 2008)
  - Static: content will last "forever" (as long as power is left on)

#### □ Main memory uses *DRAM* for size (density)

- Slower (typical access times of 50 to 70 nsec)
- High density (1 transistor cells), lower power, cheaper (\$20 to \$75 per GB in 2008)
- Dynamic: needs to be "refreshed" regularly (~ every 8 ms)
   Consumes 1% to 2% of the active cycles of the DRAM
- Addresses divided into 2 halves (row and column)
   RAS or Row Access Strobe triggering the row decoder
  - CAS or Column Access Strobe triggering the column selector

5DV008 20090212 t:5A sl:8

Irwin CSE431 PSU

## The Memory Hierarchy: Why Does it Work?

- Temporal Locality (locality in time)
  - If a memory location is referenced then it will tend to be referenced again soon
  - ⇒ Keep most recently accessed data items closer to the processor

#### Spatial Locality (locality in space)

- If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon
- $\Rightarrow$  Move blocks consisting of contiguous words closer to the processor

5DV008 20090212 t:5A sl:9













 Have a tag associated with each cache block that contains the address information (the upper portion of the address) required to identify the block (to answer Q1)

Irwin CSE431 PSU

5DV008 20090212 t:5A sl:13

5DV008 20090212 t:5A sl:14

|                                                                       |                            | ain Memory                                                      |
|-----------------------------------------------------------------------|----------------------------|-----------------------------------------------------------------|
| Cache                                                                 | )000xx<br>)001xx           | One word blocks<br>Two low order bits                           |
| dex Valid Tag Data                                                    | )010xx<br>)011xx           | define the byte in the<br>word (32b words)                      |
| 00 000000000000000000000000000000000000                               | )100xx<br>)101xx           |                                                                 |
| 10                                                                    | )110xx                     |                                                                 |
| 11                                                                    | )111xx<br>1000xx           | Q2: How do we find it?                                          |
| Q1: Is it there?                                                      | 1001xx<br>1010xx<br>1011xx | Use next 2 low order<br>memory address bits<br>– the index – to |
| Compare the cache tag to the high order 2                             | 1100xx<br>1101xx<br>1101xx | determine which<br>cache block (i.e.,                           |
| memory address bits<br>to tell if the memory<br>block is in the cache | 1111xx                     | modulo the number of blocks in the cache)                       |





5DV008 20090212 t:5A sl:15



















Taking Advantage of Spatial Locality Let cache block hold more than one word

4 hit

01 Mem(5) Mem(4) 00 Mem(3) Mem(2)

• 8 requests, 4 misses

0 1 2 3 4 3 4 15

1 hit

4 miss 01 00 Mem(1) Mem(0)

00 Mem(1) Mem(0)

00 Mem(3) Mem(2)

Start with an empty cache - all

0 miss

3 hit

00 Mem(1) Mem(0)

00 Mem(3) Mem(2)

00 Mem(1) Mem(0)

blocks initially marked as not valid



5DV008 20090212 t:5A sl:21

Irwin CSE431 PSU

2 miss

00 Mem(1) Mem(0)

00 Mem(3) Mem(2) 3 hit

01 Mem(5) Mem(4)

00 Mem(3) Mem(2)

15 miss 1 01 Mem(5) Mem(4) 09 Mem(3) Mem(2)

## Miss Rate vs Block Size vs Cache Size



Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)

5DV008 20090212 t:5A sl:22

#### **Cache Field Sizes**

- The number of bits in a cache includes both the storage for data and for the tags
  - 32-bit byte address
  - For a direct mapped cache with 2<sup>a</sup> blocks, *n* bits are used for the index
  - For a block size of 2<sup>n</sup> words (2<sup>m2</sup> bytes), m bits are used to address the word within the block and 2 bits are used to address the byte within the word
- □ What is the size of the tag field?
- The total number of bits in a direct-mapped cache is then 2° x (block size + tag field size + valid field size)
- How many total bits are required for a direct mapped cache with 16KB of data and 4-word blocks assuming a 32-bit address?

5DV008 20090212 t:5A sl:23

Irwin CSE431 PSU

Irwin CSE431 PSU

# Handling Cache Hits

- Read hits (I\$ and D\$)
  - this is what we want!

#### Write hits (D\$ only)

- require the cache and memory to be consistent
  - always write the data into both the cache block and the next level in the memory hierarchy (write-through)
  - writes run at the speed of the next level in the memory hierarchy so slow! – or can use a write buffer and stall only if the write buffer is full
- allow cache and memory to be inconsistent
  - write the data only into the cache block (write-back the cache block to the next level in the memory hierarchy when that cache block is "evicted")
- need a dirty bit for each data cache block to tell if it needs to be written back to memory when it is evicted – can use a write buffer to help "buffer" write-backs of dirty blocks

5DV008 20090212 t:5A sl:24

# Sources of Cache Misses

- Compulsory (cold start or process migration, first reference):
  - First access to a block, "cold" fact of life, not a whole lot you can do about it. If you are going to run "millions" of instruction, compulsory misses are insignificant
  - Solution: increase block size (increases miss penalty; very large blocks could increase miss rate)

#### Capacity:

- Cache cannot contain all blocks accessed by the program
- Solution: increase cache size (may increase access time)

#### Conflict (collision):

- Multiple memory locations mapped to the same cache location
- Solution 1: increase cache size
- Solution 2: increase associativity (stay tuned) (may increase access time)

5DV008 20090212 t:5A sl:25

Irwin CSE431 PSU

### Handling Cache Misses (Single Word Blocks)

- Read misses (I\$ and D\$)
  - stall the pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resume
- Write misses (D\$ only)
  - Write allocate for multiple-word blocks stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write-back cache), write the word from the processor to the cache, then let the pipeline resume, or
  - Faster write allocate for single-word blocks just write the word into the cache updating both the tag and data, no need to check for cache hit, no need to stall, or
  - No-write allocate skip the cache write (but must invalidate that cache block since it will now hold stale data) and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn't full

5DV008 20090212 t:5A sl:26

Irwin CSE431 PSU

## Multiword Block Considerations

#### Read misses (I\$ and D\$)

- Processed the same as for single word blocks a miss returns the entire block from memory
- Miss penalty grows as block size grows
  - Early restart processor resumes execution as soon as the
  - requested word of the block is returned
  - Requested word first requested word is transferred from the memory to the cache (and processor) first
- Nonblocking cache allows the processor to continue to access the cache while the cache is handling an earlier miss

Write misses (D\$)

 If using write allocate must *first* fetch the block from memory and then write the word to the block (or could end up with a "garbled" block in the cache (e.g., for 4 word blocks, a new tag, one word of data from the new block, and three words of data from the old block))

5DV008 20090212 t:5A sl:27

## Memory Systems that Support Caches





### **DRAM Size Increase**

Add a table like figure 5.12 to show DRAM growth since 1980

5DV008 20090212 t:5A sl:30

























#### Impacts of Cache Performance

Relative cache penalty increases as processor performance improves (faster clock rate and/or lower CPI)

- The memory speed is unlikely to improve as fast as processor cycle time. When calculating  $\mathrm{CPI}_{\mathrm{stall}}$  , the cache miss penalty is measured in processor clock cycles needed to handle a miss
- The lower the CPI<sub>ideal</sub>, the more pronounced the impact of stalls
- A processor with a CPI<sub>ideal</sub> of 2, a 100 cycle miss penalty, 36% load/store instr's, and 2% I\$ and 4% D\$ miss rates Memory-stall cycles = 2% × 100 + 36% × 4% × 100 = 3.44 So CPI<sub>stalls</sub> = 2 + 3.44 = **5.44**

more than twice the CPI<sub>ideal</sub> !

■ What if the CPI<sub>ideal</sub> is reduced to 1? 0.5? 0.25?

□ What if the D\$ miss rate went up 1%? 2%?

What if the processor clock rate is doubled (doubling the miss penalty)?

5DV008 20090212 t:5A sl:41

Irwin CSE431 PSU

# Average Memory Access Time (AMAT)

- A larger cache will have a longer access time. An increase in hit time will likely add another stage to the pipeline. At some point the increase in hit time for a larger cache will overcome the improvement in hit rate leading to a decrease in performance.
- Average Memory Access Time (AMAT) is the average to access memory considering both hits and misses

AMAT = Time for a hit + Miss rate x Miss penalty

What is the AMAT for a processor with a 20 psec clock, a miss penalty of 50 clock cycles, a miss rate of 0.02 misses per instruction and a cache access time of 1 clock cycle?

5DV008 20090212 t:5A sl:42



(block address) modulo (# sets in the cache)

Irwin CSE431 PSU

5DV008 20090212 t:5A sl:43

# Another Reference String Mapping

Consider the main memory word reference string























## Range of Set Associative Caches

For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (i.e., the number or ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit

| Tag | Index | Block offset Byte offset |
|-----|-------|--------------------------|
|     |       |                          |
|     |       |                          |

5DV008 20090212 t:5A sl:50



5DV008 20090212 t:5A sl:51

Irwin CSE431 PSU



- Data available after set selection (and Hit/Miss decision). In a direct mapped cache, the cache block is available before the Hit/Miss decision
  - So its not possible to just assume a hit and continue and recover later if it was a miss

5DV008 20090212 t:5A sl:52

Irwin CSE431 PSU

Irwin CSE431 PSU



5DV008 20090212 t:5A sl:53

### Reducing Cache Miss Rates #2

- 1. Use multiple levels of caches
- With advancing technology have more than enough room on the die for bigger L1 caches or for a second level of caches – normally a unified L2 cache (i.e., it holds both instructions and data) and in some cases even a unified L3 cache
- For our example, CPI<sub>ideal</sub> of 2, 100 cycle miss penalty (to main memory) and a 25 cycle miss penalty (to UL2\$), 36% load/stores, a 2% (4%) L1 I\$ (D\$) miss rate, add a 0.5% UL2\$ miss rate

CPI<sub>stals</sub> = 2 + .02×25 + .36×.04×25 + .005×100 + .36×.005×100 = 3.54 (as compared to 5.44 with no L2\$)

5DV008 20090212 t:5A sl:54

| Multilevel Cache Design Considerations                                                                                                                                                                               |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Design considerations for L1 and L2 caches are very<br>different                                                                                                                                                     |
| <ul> <li>Primary cache should focus on minimizing hit time in support of<br/>a shorter clock cycle</li> </ul>                                                                                                        |
| - Smaller with smaller block sizes                                                                                                                                                                                   |
| <ul> <li>Secondary cache(s) should focus on reducing miss rate to<br/>reduce the penalty of long main memory access times</li> <li>Larger with larger block sizes</li> <li>Higher levels of associativity</li> </ul> |
| The miss penalty of the L1 cache is significantly reduced<br>by the presence of an L2 cache – so it can be smaller<br>(i.e., faster) but have a higher miss rate                                                     |
| For the L2 cache, hit time is less important than miss rate                                                                                                                                                          |

- The L2\$ hit time determines L1\$'s miss penalty
- L2\$ local miss rate >> than the global miss rate

5DV008 20090212 t:5A sl:55

5DV008 20090212 t:5A sl:56

Irwin CSE431 PSU

# Using the Memory Hierarchy Well

□ Include plots from Figure 5.18

|                                 | Intel Nehalem                                            | AMD Barcelona                                            |
|---------------------------------|----------------------------------------------------------|----------------------------------------------------------|
| L1 cache<br>organization & size | Split I\$ and D\$; 32KB for<br>each per core; 64B blocks | Split I\$ and D\$; 64KB for each<br>per core; 64B blocks |
| L1 associativity                | 4-way (I), 8-way (D) set<br>assoc.; ~LRU replacement     | 2-way set assoc.; LRU<br>replacement                     |
| L1 write policy                 | write-back, write-allocate                               | write-back, write-allocate                               |
| L2 cache<br>organization & size | Unified; 256KB (0.25MB) per core; 64B blocks             | Unified; 512KB (0.5MB) per<br>core; 64B blocks           |
| L2 associativity                | 8-way set assoc.; ~LRU                                   | 16-way set assoc.; ~LRU                                  |
| L2 write policy                 | write-back                                               | write-back                                               |
| L2 write policy                 | write-back, write-allocate                               | write-back, write-allocate                               |
| L3 cache<br>organization & size | Unified; 8192KB (8MB)<br>shared by cores; 64B blocks     | Unified; 2048KB (2MB)<br>shared by cores; 64B blocks     |
| L3 associativity                | 16-way set assoc.                                        | 32-way set assoc.; evict block shared by fewest cores    |
| L3 write policy                 | write-back, write-allocate                               | write-back; write-allocate                               |



# Two Older Machines' Cache Parameters

|                  | Intel P4                                 | AMD Opteron                  |
|------------------|------------------------------------------|------------------------------|
| L1 organization  | Split I\$ and D\$                        | Split I\$ and D\$            |
| L1 cache size    | 8KB for D\$, 96KB for trace cache (~I\$) | 64KB for each of I\$ and D\$ |
| L1 block size    | 64 bytes                                 | 64 bytes                     |
| L1 associativity | 4-way set assoc.                         | 2-way set assoc.             |
| L1 replacement   | ~ LRU                                    | LRU                          |
| L1 write policy  | write-through                            | write-back                   |
| L2 organization  | Unified                                  | Unified                      |
| L2 cache size    | 512KB                                    | 1024KB (1MB)                 |
| L2 block size    | 128 bytes                                | 64 bytes                     |
| L2 associativity | 8-way set assoc.                         | 16-way set assoc.            |
| L2 replacement   | ~LRU                                     | ~LRU                         |
| L2 write policy  | write-back                               | write-back                   |

5DV008 20090212 t:5A sl:58

Irwin CSE431 PSU

# FSM Cache Controller

Key characteristics for a simple L1 cache

- Direct mapped
- Write-back using write-allocate
- Block size of 4 32-bit words (so 16B); Cache size of 16KB (so 1024 blocks)
- 18-bit tags, 10-bit index, 2-bit block offset, 2-bit byte offset, dirty bit, valid bit, LRU bits (if set associative)







## Summary: Improving Cache Performance

- 0. Reduce the time to hit in the cache
  - smaller cache
  - direct mapped cache
  - smaller blocks
  - for writes
    - no write allocate no "hit" on cache, just write to write buffer
    - write allocate to avoid two cycles (first check for hit, then write) pipeline writes via a delayed write buffer to cache

#### 1. Reduce the miss rate

- bigger cache
- more flexible placement (increase associativity)
- larger blocks (16 to 64 bytes typical)
- victim cache small buffer holding most recently discarded blocks

5DV008 20090212 t:5A sl:61

### Summary: Improving Cache Performance

#### 2. Reduce the miss penalty

- smaller blocks
- use a write buffer to hold dirty blocks being replaced so don't have to wait for the write to complete before reading
- check write buffer (and/or victim cache) on read miss may get lucky
- for large blocks fetch critical word first
- use multiple cache levels L2 cache not tied to CPU clock rate
- faster backing store/improved memory bandwidth
  - wider buses
  - memory interleaving, DDR SDRAMs

5DV008 20090212 t:5A sl:62

Irwin CSE431 PSU

