Simulation of Architectures for a Sustainable Computer Ecosystem

14<sup>th</sup> Aug 2024

Prof. Simon McIntosh-Smith BriCS

Director, Bristol Centre for Supercomputing (BriCS) and University of Bristol High Performance Computing Group

# Why do we care about sustainable computer architectures? – Our environment

The sectoral growth in US power demand

3% 0.9% -0.5% 2.4% 2% 0.6% 0.4% 1% 0.4% 0.6% 0% Residential Commercial Industrial Other Total Transportation Data centers Goldman Source: Goldman Sachs Research, EIA Sachs

The demand for electricity is forecast to rise at 2.4% CAGR between 2022-2030

Source: https://www.goldmansachs.com/insights/articles/AI-poised-to-drive-160-increase-in-power-demand#

# Why do we care about sustainable computer architectures? – Our environment



2 MIN READ

#### NASA Data Shows July 22 Was Earth's Hottest Day on Record

Source: https://www.nasa.gov/earth/nasa-data-shows-july-22-was-earths-hottest-day-on-record/#:~:text=July%2022%2C%202024%2C%20was%20the,record%2C%20set%20in%20July%202023.



### Why do we care about sustainable computer architectures? – Our environment



Our pledge to become a net zero carbon campus means making significant and rapid changes to reduce carbon emissions to the lowest amount – and offsetting as a last resort. We've set a target to reach net zero scope 1 and 2 carbon emissions from our buildings by 2030, and are committed to getting scope 3 emissions to net zero as soon as possible.

# Why do we care about sustainable computer architectures? – Performance



BLOG | JULY 12, 2021

#### Performance Per Watt is the New Moore's Law

The need to decarbonize compute for the sake of our planet means the technology roadmap can no longer prioritize processing power, says Rob Aitken

By Rob Aitken, Fellow & Director of Technology, Arm

Share 🎽 in 🛉 🖾 🔗

Reading time 7 min

Source: <a href="https://newsroom.arm.com/blog/performance-per-watt">https://newsroom.arm.com/blog/performance-per-watt</a>

#### Household electricity prices worldwide in December 2023,

(in U.S. dollars per kilowatt-hour)



Why do we care about sustainable computer architectures? – Cost

Source:

https://www.statista.com/statistics/263492 /electricity-prices-in-selected-countries/



. . . . . . . . .

. . . . . . . . .





## Isambard-Al



- >£300M investment by UK Government in AI capability
- Funding **~5,500 NVIDIA Grace-Hopper GPUs** in a new, <u>5MW</u> HPE modular data centre (MDC) facility in Bristol, UK
  - ~21 ExaFLOP/s of 8-bit for AI, ~250 PFLOP/s 64-bit
  - In Top 10 in the world
- Extremely rapid deployment a key requirement:
  - First conversation with UK Government on Aug 18<sup>th</sup> 2023
  - >£200M procurement written in 1 week, run in just 2 weeks
  - Contract signed and ground broken in November 2023
  - Site chosen for power availability and rapid planning permission procedures (8 weeks)







| • | • • | <b>•</b> ~ | < | > | E | a top500.org         | Ś | (J | ) ( | ĵ) | + | ſ |
|---|-----|------------|---|---|---|----------------------|---|----|-----|----|---|---|
| 9 | 9   | 6          |   |   |   | 🧰 June 2024   TOP500 |   |    |     |    |   |   |

**R**<sub>max</sub> and **R**<sub>peak</sub> values are in PFlop/s. For more details about other fields, check the TOP500 description.

 $\mathbf{R}_{peak}$  values are calculated using the advertised clock rate of the CPU. For the efficiency of the systems you should take into account the Turbo CPU clock rate where it applies.

#### Green500 Data

| Rank | TOP500<br>Rank | System                                                                                                                                                                            | Cores  | Rmax<br>(PFlop/s) | Power<br>(kW) | Energy<br>Efficiency<br>(GFlops/watts) |
|------|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|-------------------|---------------|----------------------------------------|
| 1    | 189            | JEDI - BullSequana<br>XH3000, Grace Hopper<br>Superchip 72C 3GHz,<br>NVIDIA GH200 Superchip,<br>Quad-Rail NVIDIA<br>InfiniBand NDR200,<br>ParTec/EVIDEN<br>EuroHPC/FZJ<br>Germany | 19,584 | 4.50              | 67            | 72.733                                 |
| 2    | 128            | <b>Isambard-AI phase 1</b> -<br>HPE Cray EX254n, NVIDIA<br>Grace 72C 3.1GHz, NVIDIA<br>GH200 Superchip,<br>Slingshot-11, HPE<br>University of Bristol<br>United Kingdom           | 34,272 | 7.42              | 117           | 68.835                                 |
| 3    | 55             | Helios GPU - HPE Cray<br>EX254n, NVIDIA Grace 72C<br>3.1GHz, NVIDIA GH200<br>Superchip, Slingshot-11,<br>HPE<br>Cyfronet<br>Poland                                                | 89,760 | 19.14             | 317           | 66.948                                 |



#### Isambard-AI phase 2 system now completed testing in HPE's factory in the Czech Republic



Factory visit on June 11<sup>th</sup> 2024. 5,280 GPUs now all built, in Top10 in the world

#### 5th Dec 2023 to 13th Aug 2024

11

#### Site Wednesday Aug 14<sup>th</sup> 2024

www.millina

## Simulating energy efficient architectures

- Wanted to enable research into processor microarchitecture
- Needed a fast, accurate, easy to use simulator framework
- Have developed the "Simulation Engine", or SimEng
- First commit Dec 2018 (prototype work for 2 years before this)
- Initially focusing on the Arm ISA, now also RISC-V
- Being used in projects with partners including Arm, RIKEN on FugakuNEXT and SiPearl on European processors

# SimEng enables **exploration of the design space** for future processors:

- Core
  - OoO and processor width, configuration of vector and matrix units, cache prefetch capability, branch predictors...
- Co-processor
  - Accelerators for vector-matrix maths, FFTs, ...
- Memory hierarchy
  - Smart prefetchers, sparse access support, ...
- Network
  - Network-on-Chip, inter-chip etc.

## SimEng Design Goals



**<u>Fast</u>** – millions of Out-of-Order (OoO) instructions per second on a single core



<u>Accurate</u> – cycle counts typically within ~5-10% of real hardware



Easy to modify – just a few days to produce a radically different processor model



Use existing frameworks where possible Gem5-compatible tracing, checkpointing, ...

## Currently Supported Features

- **AArch64** Armv9.2-a with SVE, SVE2 and SME extension support
  - Other extensions are supportable, but not yet targeted
  - ~1000 instruction variants supported with tests (~16% coverage)
  - Imminent SME2 support which extends the ISA from accelerating just GEMM to GEMV, CONV, etc.
- **RISC-V** rv64imafdc (base, mul/div, atomic, SP-FP, DP-FP, compressed)
  - 98%+ coverage
- Executes statically linked Linux ELF binaries
  - Support for the most common system calls
- Single-Thread OpenMP support
  - Multi-thread support running in prototype form



**benMP** 

arm

### SimEng-SST Integration



- Infinite L1D Cache, supported by base SimEng but integration with the Structural Simulation Toolkit (SST) allows us to increase simulation accuracy and scope
- SST developed by Sandia National Laboratories
- Allows for much more complex memory hierarchy simulations
  - i.e. L1->L2->L3->DRAM (or HBM)
  - Configurable bandwidths + link delays
- SST drives SimEng simulation + handles cache coherency
- Multi-core model support in prototype form (ran 128 cores in parallel)
- Investigating current prefetcher models to improve accuracy in SST (Next-n-block, stride, stream)

## Core models already implemented in SimEng

FUJITSU

- Marvell ThunderX2 ARMv8.1
- Fujitsu A64fx ARMv8.3
- Apple M1 Firestorm ARMv8.4
- Arm Neoverse V2 (NVIDIA Grace, Graviton 4) ARMv9.0
- RISC-V (generic OoO 4-way superscalar) rv64imafdc

MARVELL

RISC-V®



Source: https://chipsandcheese.com/2023/09/11/hot-chips-2023-arms-neoverse-v2/

### Some of the experiments enabled by SimEng

- Micro-architecture design studies
  - Execute pipeline length / width
  - Instruction splitting / merging
  - Reservation station configuration
  - OoO resource allocation (ROB, Phys. Registers, LSQ)
  - Number of vector units
  - Matrix engine composition
- Different branch prediction algorithms
- ISA comparisons across otherwise identical cores
- Cache prefetcher algorithms\*
- TLB walks\*







Prof. Simon McIntosh-Smith PI



Mr Dan Weaver



Mr Jack Jones Lead Developer



Mr Alex Cockrean



Mr Finn Wilkinson



Mr Joseph Moore

### Useful Links + Contacts



- Simon McIntosh-Smith : <u>s.mcintosh-smith@bristol.ac.uk</u>
- Jack Jones : jj16791@bristol.ac.uk
- SimEng Repository : <u>https://github.com/UoB-HPC/SimEng</u>
- SimEng Documentation : <u>https://uob-hpc.github.io/SimEng/</u>
- PMBS-22 SME Evaluation Paper : <u>https://ieeexplore.ieee.org/document/10024029</u>
- ModSim-23 SME Evaluation Poster : <u>https://uob-hpc.github.io/SimEng/\_downloads/</u> <u>modsim23\_poster.pdf</u>
- Second International Workshop on RISC-V for HPC RISC-V vs AArch64 Comparison: <u>https://dl.acm.org/doi/abs/10.1145/3624062.3624233</u>