



#### Master Program (Laurea Magistrale) in Computer Science and Networking

#### High Performance Computing Systems and Enabling Platforms

Marco Vanneschi

### **1. Prerequisites Revisited**

**1.4. Memory Hierarchies and Caching** 

8 March Women's Day

with best wishes !

## Memory hierarchies





### Memory hierarchies and performance issues





MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

### Memory hierarchies and address translation





Translation sequence must have very low latency  $(1 - 3 \operatorname{clock} \operatorname{cycles})$  in case of success (no fault):

- MMU (with associative memory)
- Cache Unit

Fault handling (visible or invisible to processor): allocation of the needed information.

### Memory hierarchies with Paging





Address translation applied to page identifiers (+ offset concatenation)

# Fault probability



- For a given sub-hierarchy M<sub>i</sub> M<sub>i-1</sub> (e.g. MV-MM, MM-C):
  - Fault (miss) probability, *h*
  - Page (block, line, ...) size,  $\sigma$
  - Capacity of lower memory level,  $\gamma$
  - $h = h (\sigma, \gamma, replacement strategy)$

page replacement: e.g. LRU algorithm



Experimental values for large benchmarking program sets in given application areas

#### Each program has its own fault probability. Importance of optimization techniques to reduce fault probability.



Properties of address sequences in sequential programs:

- LOCALITY (spatial locality): references to program information belong to groups of addresses which are close together.
- **REUSE** (temporal locality): some information are referred several times during the program execution.

#### **WORKING SET** of a sub-hierarchy:

Pages that, if and when allocated in the lower memory level of the sub-hierarchy, minimize the fault probability (how many and which pages)

In some cases, the compiler is able to recognize the working set and can cause some run-time actions to mantain the working set in the lower memory level of the sub-hierarchy during the program execution.



Completion time <u>of a given program</u>:

 $\mathbf{T_{c}} = \mathbf{T_{c-ideal}} + \mathbf{N_{fault}} * \mathbf{T_{block}}$ 

T<sub>c-ideal</sub> = Completion time with no faults; memory access time = cache access time

**N**<sub>fault</sub> = Average number of faults for that program

```
T<sub>block</sub> = Block transfer time
```

Relative **efficiency**:

# Examples (e.g. caching)



int A[N], B[N];
for (i = 0; i < N; i++)
A[i] = A[i] + B[i];</pre>

LOOP: LOAD RA, Ri, Ra LOAD RB, Ri, Rb ADD Ra, Rb, Rc STORE RA, Ri, Rc INCR Ri IF < Ri, RN, LOOP END

#### Reuse of instructions (loop)

 In general, the cache unit provides reuse (and prefetching) of instructions automatically

#### Data: locality only

#### Working set =

- Instructions (e.g., one block)
- One block of A
- One block of B
- + other information of the *process run-time* support (Process Control Block, Translation Table, run-time code, etc)

# Examples (e.g. caching)



| <b>int</b> A[N], B[N];                                                                | Instruct                                                                                                                      | tions: reuse (loop)                                                                                           | Reuse optimizations:                                                                                                                 |
|---------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|
| for $(i = 0; i < N; i++)$<br>A[i] = F(A[i], B);<br>LOOP: LOAD RA, Ri, Ra<br>          | Data:<br>•<br>•<br>•<br>•<br>•                                                                                                | A: locality only<br><b>B:</b> potential <b>reuse</b>                                                          | when decidable by a<br>static analysis (in this<br>case: yes);<br>typical non-decidable<br>cases: when reuse<br>opportunities depend |
| <br>LOAD RB, Rj, Rb, r<br><br>STORE RA, Ri, Rc<br>INCR Ri<br>IF < Ri, RN, LOOP<br>END | Compiler<br>annotation:<br>if provided by the<br>assembler machine<br>if supported by the<br>firmware machine<br>(Cache Unit) | <i>be entirely</i><br><i>cache;</i><br>otherwise, if B is too la<br>cannot be exploited, o<br>only partially: | uch that it can<br>allocated in<br>arge, reuse<br>r it can be exploited<br>as of B are allocated                                     |



- Write Back
  - A block is re-written in the upper memory level when it is de-allocated

### Write Through

- Every write operations is carried out into the cache *and* into the upper memory level *in parallel*
- Effective if the memory bandwidth is greater/equal to the inverse of the average time interval between two consecutive STORE instructions

# On-demand vs prefetching strategies



- On-demand: page allocation only when a fault occurs
- Prefetching: try to anticipate the page allocation (to be ahead of fault occurrence)
  - Applied to the next page, or to a page to be determined
  - *If applied by default to data*, prefetching could lead to serious inefficiencies (examples of some Intel processors)
- Compiler annotation, e.g.

LOAD RC, Rj, Rc, prefetching

## Other caching annotations



- Useful for multiprocessors:
  - explicit de-allocation of blocks
  - explicit re-writing of blocks
- Cache coherence in multiprocessors
  - See Part 2 of the course.

# Cache fault handling



- Entirely invisible to the Processor: no exception is generated
  - On the contrary: an exception is generated in the MV-MM sub-hierarchy, and handled at the process level.
  - The Cache Unit is responsible of implementing the replacement strategy and the block transfer at the firmware level.



MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

# More levels of caching

- In a program characterized by *locality only* (or few reuse opportunities), the block transfer from Main Memory to Cache could be inefficient: too large latency
  - Block transfer latency:

$$T_{block} = \sigma T_{MMaccess}$$

- The completion time is about the same of the architecture without cache !
- Even in this architecture, *reuse* can increase efficiency significantly (importance of reuse, again)

#### Interleaved memory:

- **m** independent MM modules:  $M_0, ..., M_k, ..., M_{m-1}$
- Address **j** refers module **k**:  $\mathbf{k} = \mathbf{j} \mod \mathbf{m}$
- Parallel access to m consecutive MM locations (MM bandwidth is increased by a factor m)
- Block transfer latency:

$$T_{block} = 2 T_{tr} + \frac{\sigma}{m} \tau_M + m \tau$$

Interleaving could be not sufficient to solve the problem, if MM clock cycle and link latency are too large.





### **SECONDARY CACHE (C2):**

- Capacity and block size: one order of magnitude larger than Primary Cache (C1)
- Information of more than one process in C2
- Block transfer latency:
  - If C2 is **out of** CPU chip: **m-interleaved** static memory

same formula for  $T_{block}$ , where now  $\tau_{M}$  of C2 << MM clock cycle.

- If C2 is **on-chip**: just one memory module is sufficient to achieve a good efficiency

$$T_{block} \sim \sigma \, \tau$$

applying overlapping of C2 read operations and C1 write operations.

# Cache cost model: an example



```
Completion time:
int A[N], B[N];
                                                                           T_c = T_{c-ideal} + N_{fault} * T_{block}
     for (i = 0; i < N; i++)
           A[i] = F(A[i], B);
                                               T<sub>c-ideal</sub> = Completion time with no faults
                                               N<sub>fault</sub> = Average number of faults for the program
                                               T<sub>block</sub> = Block transfer time
LOOP:
            - LOAD RA, Ri, Ra
                                                                           T_{r} = k \tau
                                               Let:
                                                                           T_{c-ideal} \sim N T_F = k N \tau
                                               Then:
              LOAD RB, Rj, Rb, reuse
                                               If B reuse can be applied:
                                                                 N_{fault} = N_{fault-instructions} + N_{fault-A} + N_{fault-B}
                                                                 \sim N_{fault-A} + N_{fault-B} = N/\sigma + N/\sigma = 2 N/\sigma
              STORE RA, Ri, Rc
                                               With <u>C2 on-chip</u>: T_{block} \sim \sigma \tau
              INCR Ri
                                                            T_c \sim k N \tau + 2 N \tau \sim T_{c-ideal} for k >> 2
              IF < Ri, RN, LOOP
                                               Then:
              END
                                               Efficiency:
                                                                          \epsilon_{cache} = T_{c-ideal} / T_c \sim 1
                                               Effect of faults is not significant, with the applied assumptions and
                                               optimizations.
```



Mapping function of MM block identifier (MB) into C block identifier (CB):

CB = mapping\_function (MB)

1) DIRECT CACHE:

CB = MB *mod* NC

where NC = number of blocks in C

+ logic for verifying the possible fault

2) FULLY ASSOCIATIVE CACHE: CB = Block\_Table (MB)

Random Association - Table Lookup implementation

Associative Memory

Direct Cache: rigid association, leading to increased fault probability in some programs; cheap realization; lowest latency in case of success

Fully Associative: most general and most flexible, LRU applied for block replacement, at least one more clock cycle of latency



Hardware implementation of a Content-Addressable Table





#### 3) Trade-off: SET ASSOCIATIVE CACHE

Cache blocks are partitioned into SETS, e.g. 4 blocks per set.

Let **NS** = number of cache sets.

Identifier of Cache Set where the Memory Block may reside:

#### SET = MB *mod* NS

Inside this Set, the Memory Block can be allocated anywhere: Associative Technique is applied to the blocks in each Set.

- Simple implementation (no Associative Memory).
- Performances (fault probability) similar to Fully Associative Cache, LRU applied to Set blocks, same latency in case of success.

#### **TYPICAL ARCHITECTURE:**

- Instruction Cache: Direct
- Data Cache: Set (Fully) Associative



to be submitted and discussed at Question Time

### **PRIMARY and SECONDARY CACHE**

- Assume that Secondary Cache resides on CPU chip
- Why distinguishing between Primary Cache (L1) and Secondary Cache (L2) ?
- *i.e.*, why not just one level of cache only, with larger capacity (sum of capacity of L1 and of L2) ?

As usually: give a convincing answer from a methodological viewpoint (cost model, opportunities, utilization policies, and so on), ... not because existing processors have both L1 and L2!

### Test 2:



### to be submitted and discussed at Question Time

#### **CACHING and I/O**

- Memory Mapped I/O: can CPU-caching be applied efficiently to information allocated in the I/O Memories?
- Evaluate the efficiency (or other parameters) with an I/O Bus structure
  - *hint:* for the sake of the cost model, assume that the I/O Bus adopts a RDY-ACK communication protocol with level-transition interfaces as for dedicated links (See: Firmware Prerequisites) although this assumption is not formally correct;
  - does the evaluation depend on the *reuse* opportunities of I/O-mapped information ? (give the answer wrt this specific question, not in general).
- Find possible I/O architectures (suitable interconnection structure and memory organization) able to exploit CPU-caching at best when Memory Mapped I/O is adopted.