#### **Prof Simon McIntosh-Smith**

University of Bristol

@simonmcs



# **Enabling Processor Design Space Exploration with SimEng**





# Some history

- Started my career at Inmos in Bristol in 1994
  - Transputers, Occam, ...





- Worked as an architect on "Chameleon" designing a SIMD instruction set for a dual-core, 64-bit, dual-issue, out-of-order CPU
- Very advanced workflow for the time
  - A single 'master' instruction set database drove everything
    - Documentation
    - Simulator
    - Compiler / assembler
    - Test / verification / ...





# Early design space exploration

- The electronic spec-led workflow enabled rapid CPU design space exploration
- We could change most parameters about the architecture and microarchitecture, and regenerate everything quickly to try rigorous experiments
  - From the ISA to the number and spec of execution units etc.
  - Size and structure of reservation stations, memory hierarchy, ...
- I rejoined academia in 2009 and wanted to try these kinds of experiments this wasn't as straightforward as I expected...





# Motivation – designing gas turbines 'in silico'

#### ASiMoV 5-year project with Rolls-Royce

Aiming to design new gas turbines completely in simulation

Many different kinds of physics need to be modelled simultaneously



1 Trillion degrees of freedom A commercial Exascale problem



Electromagnetic

Thermo-mechanical



**Contact and Friction** 



**Computational Fluid Dynamics** 

Combustion





#### So what do we want to be able to do for ASiMoV?

Explore the design of an "optimal" processor for 5–10 years' time?

- Core level
  - OoO parameters, number and width of vector units, prefetch capability...
- Co-processor level
  - Accelerators for vector—matrix math, FFTs, ...
- Memory hierarchy level
- Network level





# To address these questions...

... we need a fast, easy to modify, accurate-enough simulator to support semi-automated design space exploration.

In theory, we could do this with gem5 or a number of other simulators

But we found they didn't have the specific combination of speed and accuracy to let us do the things we needed.

The "Simulation Engine" was born to investigate these issues...





# SimEng design goals

#### **Primary goals:**

- Fast millions of OoO instructions per second on a single core
- Accurate within 10–20% of hardware
- <u>Easy to modify</u> days for a radically different processor model

### **Secondary goals:**

- Use existing frameworks where possible
  - CAPSTONE for instruction decode, SST for memory hierarchy / multicore
  - Gem5-compatible tracing, checkpointing, ...











The ThunderX2 simulation was within 5-10% of the real hardware in **Isambard** 























# **Current status (10 months in)**



- Targeting Armv8.1 initially, using CAPSTONE, which also supports x86, RISC-V, POWER, ...
  - Currently supports 230+ instructions, ~10% of the ISA
- Basic syscall emulation
  - Enough to handle libc startup routines in real binaries (compiled from C)
  - Basic printf support
  - malloc and file I/O in progress
- Current limitations:
  - Requires static binaries
  - Models up to the load/store units, planning to plug in existing models for the memory hierarchy (SimEng includes its own infinite L1 cache model)
  - Single-core only





# **Early experiments**

- Running McCalpin's STREAM benchmark
  - Run a problem small enough to fit in L1D cache
  - Using an out-of-order/superscalar core model, parameterized for ThunderX2
  - The STREAM run takes ~10ms on a real ThunderX2 core
- SimEng running on an AMD Ryzen 7 2700 @ 4.0 GHz
  - OoO takes  $^2$ 26 seconds  $\rightarrow$  738 kHz / 1.84 MIPS
  - Atomic mode runs at around 6.4 MIPS
  - Cycle count error is 3.7% versus real ThunderX2 hardware
- gem5.fast (built from Arm's sve/beta1 branch, same AMD host CPU)
  - OoO takes  $\sim$ 105 seconds  $\rightarrow$  171 kHz / 0.45 MIPS (SimEng 4.3X / 4.1X)
  - Atomic mode runs at around 2.4 MIPS (SimEng ~2.7X)
  - Cycle count error is 9.1% versus real ThunderX2 hardware







# Stats about the project

- ~10,000 lines of simple, modern C++
  - ~3,000 lines are specific for Armv8 support
  - Another ~5,000 lines of test code across nearly 200 tests
- Includes a full Continuous Integration (CI) workflow
  - CircleCl, Googletest
- Supported host platforms include: Ubuntu, CentOS and macOS
- Will be released under a permissive LLVM-style license





# **Near-term plans**

- Continue building up instruction support
  - Will start trying different compilers and Fortran codes to help with this
- Tune model accuracy for a wider range of kernels
- Add SVE support (Arm's new length-agnostic vector instruction set)
- A64fx model (needs some additional work in pipeline)
- Plugin interface to enable extensibility
  - Prototype tracing functionality already implemented
- Aiming to share with select partners in coming weeks
  - Currently asking for some simple kernels so that we can add instruction support and check correctness
- Aiming for initial open-source release in 3–6 months
- Considering integration with SST to enable multi-core simulation









# **Example Processor Trace**



#### stanford.opt.trc

```
DEFINE TRACETYPE prouter
(REO Pkt Request Send) circle yellow
(REQ Pirt Header Send)
                       righterrow
(REQ PM: EOP Send)
                        lefterroy
(RdP Pkt Request Send)
                       circle yellow
(RSP Pkt Header Send)
                       righterrow
                                        blue
(RSP Pkt BOP Send)
                        lefterrow
                                        blue
DEFINE TRACETYPE MIG
(Potch first half)
                        righterrow
                                        red
(Fetch last half)
                        leftarrow
                                        red
(Decode)
               circle blue
(Dispatch start)
                        righterrow
                                        white
(Dispatch taken)
                       lefterrow
                                        white
(Execute start) righterrow
(Execute end) lefterrow
                                red
(Complete invalid)
(Complete walld)
                        equare green
(Betire)
               circle yellow
(Store commit) circle white
(Flush pipeline)
                        dismond pumple
TRACE prouter
                        "packets on prouter"
TRACETERY
                tHE0
TRACE MIG pht1
                    "Mid core implementation of cpub"
TRACETERT
PROBE peb
```

Trace simulator from 1996. Written in Tcl/Tk





issue.portBusyStalls

# **Acknowledgments**

- Key development team in Bristol:
  - Hal Jones, James Price, Andrei Poenaru, Jack Jones

- Funders:
  - EPSRC ASiMoV project (Advanced Simulation and Modelling of Virtual systems) - EP/S005072/1
  - Arm via a Centre of Excellence in HPC at University of Bristol





#### **Conclusions**

- Using SimEng to explore how fast we can make a microarchitecture level simulator
  - Hope to provide useful input for the RE-gem5 project
- Also exploring how easy we can make it to make major changes to a microarchitecture to enable rapid design space exploration
- Early experiments suggest >4X speedup over gem5 is possible for a single core OoO model of ThunderX2
- We now have a fast, fairly accurate, stand-alone, single-core model in O(10,000) lines of code – what else is this useful for?



