Bruce Jacob

University of Maryland

SLIDE I



# Faster Keystone Professor University of Maryland

# and Accurater The Future of Memory-System Modeling and Simulation

**Bruce Jacob** (with Ph.D. results of Shang Li)





Bruce Jacob

University of Maryland

SLIDE 2

# We can **get up here** (e.g., via prediction)

Based

# Cycle Accurate HDL

### Simulation Accuracy



Background tRP = 15ns

> Bank Precharge

Not only Faster but Accurater, too

Bruce Jacob

University of Maryland

SLIDE 3

### TIME



### tRCD = 15ns, tRAS = 37.5ns

### **Row Activate (15ns)** and Data Restore (another 22ns)



Cost of access is high; requires significant effort to amortize this over the (increasingly short) payoff.



CPU/\$ / Read A Outgoing bus request Read B Write X, data Read Z Write Q, data Write A, data Read W Read Z Read Y WR MC

Not only Faster but Accurater, too

Bruce Jacob

University of Maryland

SLIDE 4





Bruce Jacob

University of Maryland

SLIDE 5

# Faster?

Simulation Speed

- Simulation speed: 100X faster
- Error: < 20%
- 10s of cores simulated on 10s of cores



Marss-O3 Gem5-O3





### Faster?

### Easily Predictable Result:

Memory-System Simulation is now Limiting Factor

Not only Faster but Accurater, too

Bruce Jacob

University of Maryland

SLIDE 6





• • •

Bruce Jacob

# **Even Faster via Prediction**

# Statistical DRAM Model Proposed Approach

• • •

### Turning DRAM timing simulation into a classification problem

| Clock | Address    | OP    |
|-------|------------|-------|
| 0     | 0x01230000 | READ  |
| 12    | 0x01230020 | READ  |
| 40    | 0x0123003C | READ  |
| 65    | 0x06340000 | WRITE |
|       |            |       |

С

|               | Class    |          | Latency |
|---------------|----------|----------|---------|
| lassification | Idle     | Recovery | 36      |
|               | Row-Hit  |          | 22      |
|               | Row-Hit  |          | 22      |
|               | Row-Miss |          | 56      |
|               | • • •    |          | • • •   |







Bruce Jacob

University of Maryland

**Training Process** 

### Training: Supervised Learning





~7000 Requests

Various access patterns, inter-arrival timings to cover all kinds of workloads



Bruce Jacob

University of Maryland

# Models (performed the same)

### Models: Decision Tree & Random Forest









Bruce Jacob

University of Maryland

SLIDE 12



# Results: <u>Way</u> Faster



Not only Faster but Accurater, too

Bruce Jacob

University of Maryland

SLIDE 12



Classification Accuracy

Average Latancy Accuracy





but Accurater, too





Bruce Jacob

University of Maryland

SLIDE 14

ZSim 2-phase memory model timeline diagram compared with real hardware/cycle accurate model.

Three back-to-back memory requests (0, 1, 2) are issued to the memory model.

First phase of memory access aggressively schedules reqs for performance; second phase fails to take into account dependence information.

if (INSTR.isMemOp) { if (L1 cache miss(INSTR.dAddr)) { if (L2 cache miss(INSTR.dAddr)) { INSTR.valid = now + DRAM request (INSTR.dAddr);

Not only Faster but Accurater, too

Bruce Jacob

University of Maryland

SLIDE 15

# But Wait — <u>and</u> Accurat<u>er</u>?

## What Programmers WANT: (and if you can do it $\rightarrow$ accurate, parallel sims)

![](_page_15_Picture_8.jpeg)

if (INSTR.isMemOp) { if (L1 cache miss(INSTR.dAddr)) { if (L2 cache miss(INSTR.dAddr)) { INSTR.valid = now + DRAM request (INSTR.dAddr);

Not only Faster but Accurater, too

Bruce Jacob

University of Maryland

SLIDE 15

# But Wait — <u>and</u> Accurat<u>er</u>?

## What Programmers WANT: (and if you can do it $\rightarrow$ accurate, parallel sims)

# Prediction gives it to them

![](_page_16_Picture_9.jpeg)

# **Bottom Li** The Future:

Large parallel simulations enabled, wherein each CPU model can have its own memory-system predictor to provide estimates of main memory-system latency.

None of the memory models need interact to provide their predictions.

Moreover, the CPU models can be written in a FAR simpler way than they are now, making them faster and less likely to contain "gotcha" assumptions.

Not only Faster but Accurater, too

Bruce Jacob

University of Maryland

SLIDE 16

# Bottom Line (scalability)

![](_page_17_Figure_9.jpeg)

![](_page_17_Picture_10.jpeg)

# **Shameless** Plug

![](_page_18_Picture_1.jpeg)

# Washington DC Sep 30 – Oct 3, 2019

Not only Faster but Accurater, too

Bruce Jacob

University of Maryland

SLIDE 17

# **MEMSYS 2018**

The International Symposium on Memory Systems \* October 1–4, Washington DC

| Keynote Addresses |                               |  |  |  |
|-------------------|-------------------------------|--|--|--|
| Hardware Keynote: | Steve Wallack<br>Micror       |  |  |  |
| Software Keynote: | <b>Brian Barret</b><br>Amazor |  |  |  |
| Postamble: J The  | omas Pawlowsk<br>Micror       |  |  |  |

**Panelists** 

# WWW.memsys.io

Zeshan Chishti, Intel Zhaoxia (Summer) Deng, Facebook Chen Ding, U. Rochester David Donofrio, Berkeley Lab Dietmar Fey, FAU Erlangen-Nürnberg Maya Gokhale, LLNL Xiaochen Guo, Lehigh U. Manish Gupta, NVIDIA Fazal Hameed, TU Dresden Matthias Jung, Fraunhofer IESE Kurt Keville, MIT Hyesoon Kim, Georgia Tech Scott Lloyd, LLNL Sally A. McKee, Clemson Moinuddin Qureshi, Georgia Tech Petar Radojkovic, BSC Arun Rodrigues, Sandia National Labs Robert Voigt, Northrop Grumman Gwendolyn Voskuilen, Sandia David T. Wang, Samsung Vincent Weaver, U. Maine Norbert Wehn, U. Kaiserslautern Yuan Xie, UC Santa Barbara Ke Zhang, Chinese Acad. of Sciences Xiaodong Zhang, Ohio State Jishen Zhao, UC San Diego

Micron

Memory-device manufacturing, memory-architecture design, and the use of memory technologies by application software all profoundly impact today's and tomorrow's computing systems, in terms of their performance, function, reliability, predictability, power dissipation, and cost. Existing memory technologies are seen as limiting in terms of power, capacity, and bandwidth. Emerging memory technologies offer the potential to overcome both technologyand design-related limitations to answer the requirements of many different applications. Our goal is to bring together researchers, practitioners, and others interested in this exciting and rapidly evolving field, to update each other on the latest state of the art, to exchange ideas, and to discuss future challenges.

### **Conference Schedule and Venue**

The conference will be held at the Gaylord National Resort & Convention Center at The National Harbor, Maryland. An opening reception will be held on Monday evening, followed by 2 1/2 days of technical presentations (full days on Tuesday and Wednesday, a half length technical day on Thursday), Conference Dinner Wednesday evening, and Awards Luncheon Tuesday afternoon. A discounted room block is still available on the registration site, with only a few rooms left.

### Tracks and Topics

The following topics will be presented over the 3-day conference:

- Memory-system design from both hardware and software perspectives
- Memory failure modes and mitigation strategies
- Memory-system resilience, especially at large scale
- Memory and system security issues
- Operating system design for hybrid/nonvolatile memories
- Technologies like flash, DRAM, STT-MRAM, 3DXP, memristors, etc.
- Memory-centric programming models, languages, optimization
- Compute-in-memory and compute-near-memory technologies
- Large-scale data movement: networks, hardware, software, mitigation
- Virtual memory redesign for unifying storage/memory/accelerators
- Algorithmic & software memory-management techniques
- Emerging memory technologies, both hardware and software, including memory-related blockchain applications
- Interference at the memory level across datacenter applications
- Issues in the design and operation of large-memory machines
- In-memory databases and NoSQL stores
- Post-CMOS scaling efforts and memory technologies to support them, including cryogenic, neural, quantum, and heterogeneous memories
- The conference focuses on these and other related topics.

### **Publications** & **Presentations**

All accepted papers will be published in the ACM & IEEE Digital Libraries. Our primary goal is to showcase interesting ideas that will spark conversation between disparate groups-to get applications people, operating systems people, system architecture people, interconnect people and circuits people to talk to each other. Thus, we try to showcase interesting ideas in a format that will facilitate this. The talks are short, to encourage participation and discussion. Every evening we host a panel discussion of invited speakers, with beer, wine, and hot hors d'oeuvres.

intel

**FIT** Sandia National Laboratories

![](_page_18_Picture_40.jpeg)

![](_page_18_Picture_41.jpeg)

NORTHROP GRUMMAN

Lawrence Livermore National Laboratory

![](_page_18_Figure_42.jpeg)

![](_page_18_Picture_44.jpeg)

Bruce Jacob

University of Maryland

SLIDE 18

![](_page_19_Picture_4.jpeg)

# **Bruce Jacob** blj@umd.edu www.ece.umd.edu/~blj

![](_page_19_Picture_7.jpeg)

![](_page_19_Picture_8.jpeg)

Bruce Jacob

University of Maryland

SLIDE 19

![](_page_20_Picture_5.jpeg)

**Backup Slides** 

![](_page_20_Picture_7.jpeg)

### Nomenclature

Not only Faster but Accurater, too

Bruce Jacob

University of Maryland

SLIDE 20

![](_page_21_Figure_5.jpeg)

![](_page_22_Figure_0.jpeg)

Bruce Jacob

University of Maryland

SLIDE 22

# Features Extracted

| Feature         | Values | Description                                                                          | Intuition                                                   |
|-----------------|--------|--------------------------------------------------------------------------------------|-------------------------------------------------------------|
| same-row-last   | 0/1    | whether the last request<br>that goes to same bank has the same row<br>(as this one) | key factor for the most<br>recent bank state                |
| is-last-recent  | 0/1    | whether the last request to the<br>same bank added recently (tRC)                    | relevancy of the last request<br>to the same bank           |
| is-last-far     | 0/1    | whether the last request to the<br>same bank added long ago (tRFC)                   | relevancy of the last request<br>to the same bank           |
| ор              | 0/1    | operation(read/write)                                                                | for potential R/W scheduling                                |
| last-op         | 0/1    | operation of last request to the same bank                                           | for potential R/W scheduling                                |
| ref-after-last  | 0/1    | whether there is a refresh since<br>last request to the same bank                    | refresh reset the<br>bank to idle                           |
| near-ref        | 0/1    | whether this cycle is near a refresh cycle                                           | latency can be really<br>high if it's near a refresh        |
| same-row-prev   | int    | number of previous requests with<br>same row to the same bank                        | if there is same row<br>request then OOO<br>may be possible |
| num-recent-bank | int    | number of requests added recently<br>to the same bank                                | contention/queuing<br>in the bank                           |
| num-recent-rank | int    | number of recent requests added<br>recently to the same rank                         | contention                                                  |
| num-recent-all  | int    | number of recent requests added<br>recently to all ranks                             | contention                                                  |

![](_page_23_Picture_6.jpeg)