

#### Recall: X86\_64: Four-level page table!



### From x86\_64 architecture specification



- All current x86 processor support a 64 bit operation
- 64-bit words (so ints are 8 bytes) but 48-bit addresses

Lec 16.3

3/12/24

Kubiatowicz CS162 © UCB Spring 2024

### Larger page sizes supported as well



#### • Larger page sizes (2MB, 1GB) make sense since memory is now cheap

- Great for kernel, large libraries, etc
- Use limited primarily by internal fragmentation...

| 3/12/24 | Kubiatowicz CS162 © UCB Spring 2024 | Lec 16.5 | 3/12/24 |
|---------|-------------------------------------|----------|---------|
|         |                                     |          |         |

## Address Translation Comparison

|                       | Advantages                                                   | Disadvantages                                                    |  |
|-----------------------|--------------------------------------------------------------|------------------------------------------------------------------|--|
| Simple Segmentation   | Fast context switching<br>(segment map maintained by<br>CPU) | External fragmentation                                           |  |
| Paging (Single-Level) | No external fragmentation<br>Fast and easy allocation        | Large table size (~ virtual<br>memory)<br>Internal fragmentation |  |
| Paged Segmentation    | Table size ~ # of pages in                                   | Multiple memory references per page access                       |  |
| Multi-Level Paging    | virtual memory<br>Fast and easy allocation                   |                                                                  |  |
| Inverted Page Table   | Table size ~ # of pages in physical memory                   | Hash function more complex<br>No cache locality of page<br>table |  |

### Recall: Alternative: Inverted Page Table

- With all previous examples ("Forward Page Tables")
  - Size of page table is at least as large as amount of virtual memory allocated to processes
  - Physical memory may be much less
    - » Much of process space may be out on disk or not in use



#### How is the Translation Accomplished?



- The MMU must attempt to translate virtual address to physical address on:
  - Every instruction fetch, Every load, Every store
  - Generate a "Page Fault" (Trap) if it encounters invalid PTE
    » Fault handler will decide what to do (more on this next lecture)
- · What does the MMU need to do to translate an address?
  - 1-level Page Table
    - » Read PTE from memory, check valid, merge address
    - » Set "accessed" bit in PTE, Set "dirty bit" on write
  - 2-level Page Table
  - » Read and check first level
  - » Read, check, and update PTE at second level
  - N-level Page Table …
- MMU does *page table Tree Traversal* to translate each address
  - Turns a potentially fast memory access into a slow multi-access table traversal...
- 3/12/24 Need CACHING!
- Kubiatowicz CS162 © UCB Spring 2024

#### Where and What is the MMU?



- The processor requests READ Virtual-Address to memory system - Through the MMU to the cache (to the memory)
- Some time later, the memory system responds with the data stored at the physical address (resulting from virtual  $\rightarrow$  physical) translation
- Fast on a cache hit, slow on a miss
- So what is the MMU doing?
- On every reference (I-fetch, Load, Store) read (multiple levels of) page table entries to get physical frame or FAULT
  - Through the caches to the memory
  - Then read/write the physical location

#### Recall: CS61c Caching Concept



- Cache: a repository for copies that can be accessed more guickly than the original
  - Make frequent case fast and infrequent case less dominant
- · Caching underlies many techniques used today to make computers fast
  - Can cache: memory locations, address translations, pages, file blocks, file names, network routes, etc...
- Only good if:
  - Frequent case frequent enough and
  - Infrequent case not too expensive
- Many important OS concepts boil down to caching! We cache:
  - Pages, Files, Virtual Memory Translations, IP Addresses...



## Recall: In Machine Structures (eq. 61C) ...



Another Major Reason to Deal with Caching



- Solution? Cache translations!
  - Translation Cache: TLB ("Translation Lookaside Buffer") Kubiatowicz CS162 © UCB Spring 2024







Read miss may require writeback of dirty data

Lec 16.23

#### How do we make Address Translation Fast?

- Cache results of recent translations !
  - Different from a traditional cache
  - Cache Page Table Entries using Virtual Page # as the key



#### Translation Look-Aside Buffer

- Record recent Virtual Page # to Physical Frame # translation
- If present, have the physical address without reading any page tables !!!
  - Even if the translation involved multiple levels
  - Caches the end-to-end result
- Was invented by Sir Maurice Wilkes prior to caches
  - When you come up with a new concept, you get to name it!
  - People realized "if it's good for page tables, why not the rest of the data in memory?"
- On a *TLB miss*, the page tables may be cached, so only go to memory when both miss

| Kubiatowicz CS162 © UCB Spring 2024 | Lec 16.25 | 3/12/24 | Kubiatowicz CS162 © UCB Spring 2024 | Lec 16.26 |
|-------------------------------------|-----------|---------|-------------------------------------|-----------|
|                                     |           |         |                                     |           |
|                                     |           |         |                                     |           |

### Caching Applied to Address Translation



- · Question is one of page locality: does it exist?
  - Instruction accesses spend a lot of time on same page (accesses are sequential)
  - Stack accesses have definite locality of reference
  - Data accesses have less page locality, but still some...
- Can we have a TLB hierarchy?
  - Sure: multiple levels at different sizes/speeds

```
Lec 16.27
```

## Physically-Indexed vs Virtually-Indexed Caches





### Making physically-indexed caches fast: Fit into Pipeline!

Example: MIPS R3000 Pipeline Dcd/ Reg Inst Fetch ALU / E.A Memory Write Rea I-Cache RF WB TLB Operation E.A. TLB D-Cache

TLB 64 entry, on-chip, fully associative, software TLB fault handler





Lec 16.31

### Further reducing translation time for physically-indexed caches

–10 –→ · As described, TLB lookup is in serial with V page no. offset cache lookup - Consequently, speed of TLB can impact TLB Lookup speed of access to cache V Access Rights PA Machines with TLBs go one step further: overlap TLB lookup with cache access - Works because offset available early - Offset in virtual address exactly covers the "cache index" and "byte select" - Thus can select the cached byte(s) in parallel to perform address translation virtual address: Virtual Page # Offset physical address: tag / page # index bvte 3/12/24





#### Recent Intel x86 (Skylake, Cascade Lake) **Precise Exceptions** Precise $\Rightarrow$ state of the machine is preserved as if program executed up to the offending instruction - All previous instructions completed - Offending instruction and all following instructions act as if they have not even started - Same system code will work on different implementations - Difficult in the presence of pipelining, out-of-order execution, ... - x86 takes this position Imprecise ⇒ system software has to figure out what is where and put it all back together Performance goals often lead designers to forsake precise interrupts - system software developers, user, markets etc. usually wish they had not done this MB 12 Modern techniques for out-of-order execution and branch prediction help 54B/cycle To L3 implement precise interrupts 3/12/24 Kubiatowicz CS162 © UCB Spring 2024 Lec 16.37 3/12/24 Lec 16.38 Kubiat

## Recent Example: Memory Hierarchy

- Caches (all 64 B line size)
  - L1 I-Cache: 32 <u>KiB</u>/core, 8-way set assoc.
  - L1 D Cache: 32 KiB/core, 8-way set assoc., 4-5 cycles load-to-use, Write-back policy
  - L2 Cache: 1 MiB/core, 16-way set assoc., Inclusive, Write-back policy, 14 cycles latency
  - L3 Cache: 1.375 MiB/core, 11-way set assoc., shared across cores, Non-inclusive victim cache, Write-back policy, 50-70 cycles latency
- TLB

#### - L1 ITLB, 128 entries; 8-way set assoc. for 4 KB pages

- $\,$  \* 8 entries per thread; fully associative, for 2 MiB / 4 MiB page
- L1 DTLB 64 entries; 4-way set associative for 4 KB pages
  - $\,$  » 32 entries; 4-way set associative, 2 MiB / 4 MiB page translations:
  - » 4 entries; 4-way associative, 1G page translations:
- L2 STLB: 1536 entries; 12-way set assoc. 4 KiB + 2 MiB pages
  - » 16 entries; 4-way set associative, 1 GiB page translations:

## What happens on a Context Switch?

- Need to do something, since TLBs map virtual addresses to physical addresses
  - Address Space just changed, so TLB entries no longer valid!
- Options?
  - Invalidate ("Flush") TLB: simple but might be expensive
    - » What if switching frequently between processes?
  - Include ProcessID in TLB
    - » This is an architectural solution: needs hardware
- What if translation tables change?
  - For example, to move page from memory to disk or vice versa...
  - Must invalidate TLB entry!
    - » Otherwise, might think that page is still in memory!
  - Called "TLB Consistency"
- Aside: with Virtually-Indexed, Virtually-Tagged cache, need to flush cache!
  - Everyone has their own version of the address "0" and can't distinguish them
  - This is one advantage of Virtually-Indexed, Physically-Tagged caches..



## Putting Everything Together: Cache



## Page Fault Handling

- The Virtual-to-Physical Translation fails
  - PTE marked invalid, Privilege Level Violation, Access violation, or does not exist
  - Causes an Fault / Trap
    - » Not an interrupt because synchronous to instruction execution
  - May occur on instruction fetch or data access
  - Protection violations typically terminate the process
- Other Page Faults engage operating system to fix the situation and retry the instruction
  - Allocate an additional stack page, or
  - Make the page accessible (Copy on Write),
  - Bring page in from secondary storage to memory demand paging
- Fundamental inversion of the hardware / software boundary
  - Need to execute software to allow hardware to proceed!

## **Demand Paging**

- Modern programs require a lot of physical memory – Memory per system growing faster than 25%-30%/year
- But they don't use all their memory all of the time
  90-10 rule: programs spend 90% of their time in 10% of their code
  Wasteful to require all of user's code to be in memory
  - wasterul to require all of user's code to be in memory
- Solution: use main memory as "cache" for disk





Lec 16.45

3/12/24

### Page Fault ⇒ Demand Paging



### Summary (1/2)

- The Principle of Locality:
  - Program likely to access a relatively small portion of the address space at any instant of time.
    - » Temporal Locality: Locality in Time
    - » Spatial Locality: Locality in Space
- Three (+1) Major Categories of Cache Misses:
  - Compulsory Misses: sad facts of life. Example: cold start misses.
  - Conflict Misses: increase cache size and/or associativity
  - Capacity Misses: increase cache size
  - Coherence Misses: Caused by external processors or I/O devices
- · Cache Organizations:
  - Direct Mapped: single block per set
  - Set associative: more than one block per set
  - Fully associative: all entries equivalent

# Summary (2/2)

- "Translation Lookaside Buffer" (TLB)
  - Small number of PTEs and optional process IDs (< 512)
  - Often Fully Associative (Since conflict misses expensive)
  - On TLB miss, page table must be traversed and if located PTE is invalid, cause Page Fault
  - On change in page table, TLB entries must be invalidated
- · Demand Paging: Treating the DRAM as a cache on disk
  - Page table tracks which pages are in memory
  - Any attempt to access a page that is not in memory generates a page fault, which causes OS to bring missing page into memory

Lec 16.47