# Challenges in AI Infrastructure for Enterprise Foundation Models

Jeffrey L. Burns, Ph.D. Director, AI Compute and IBM Research AI Hardware Center IBM Research

August 9, 2023



# Foundation Models: An inflection point in generalizable and adaptable representations



### **Expert Systems**

Hand-crafted symbolic representations

### **Machine Learning**

Task-specific hand-crafted feature representations

### **Deep Learning**

Task-specific learnt feature representations

### **Foundation Models**

Generalizable & adaptable learnt representations

# Incredible opportunities around enterprise applications





In each of these domains there is ample unlabeled data available in enterprises, which can be used to train custom foundation models, potentially opening the doors for solving business problems that were previously considered intractable.

# **Geospatial Foundation Models**

IBM and NASA have teamed up to apply **foundation model AI technology** to leverage earth science data for **geospatial intelligence**.



This work with NASA is part of an effort across IBM Research to pioneer applications of foundation models beyond language.

https://www.earthdata.nasa.gov/news/impact-ibm-hls-foundation-model

IBM Research AI Hardware Center / © 2023 IBM Corporation

**Pre-trained** on sufficient datasets in partnership with content-rich institutions (e.g. NASA)

Leverage **self-supervised learning** (i.e., masking imagery or timeseries)

Able to effectively complete **multiple downstream tasks** while meeting accuracy baselines (e.g., flood mapping, land cover classification, outage prediction)

Note: while transformer architecture is most prevalent in foundation models, definition not restricted by model architecture

# The flip side

"So, we think it's fair to say that, right now, access to compute resources — at the lowest total cost has become a determining factor for the success of AI companies."



#### andreessen. horowitz It's time to build

| Portfolio | Team | Focus Areas $\vee$ |
|-----------|------|--------------------|
|           |      |                    |



### Navigating the High Cost of Al Compute

by Guido Appenzeller, Matt Bornstein, and Martin Casado

Al, machine & deep learning • enterprise & SaaS • cloud infrastructure • Generative

f in 🎔

#### Table of contents

- Why are models expensive?
- Time + cost Build or buy?
- Cloud or data center?
- Comparing CSPs
- Comparing GPUs
- Optimizations
- How will costs evolve?



<u>a16z</u> analysis, April 2023

### Optimizing the infrastructure for Foundation Models Across the whole AI workflow



# Building the FM technology stack

Middleware that simplifies end-to-end AI workflow and optimizes use of underlying infrastructure

Platform that deliver portability and abstracts infrastructure complexity

World-class infrastructure for training, tuning and serving foundation models (on-prem and in the cloud)

IBM Research AI Hardware Center / © 2023 IBM Corporation



## AI-optimized infrastructure

### Training: Vela

Cloud-native design for large-scale distributed model training



### Inference: IBM AIU

Designed for energyefficient AI compute at reduced precision

| Precision<br>Sparsity | BERT-base<br>(F1%) | Wav2vec2.0<br>(WER %) | ViT<br>(Accuracy %) |
|-----------------------|--------------------|-----------------------|---------------------|
| FP32                  | 88.69              | 4.20                  | 84.12               |
| INT8                  | 88.35 (-0.34)      | 3.85 (+0.35)          | 82.47 (+0.35)       |
| INT8+50%Sp            | 87.70 (-0.99)      | 4.21 (-0.01)          | 84.03 (-0.09)       |
| INT4                  | 87.86 (-0.83)      | 4.53 (-0.33)          | 83.49 (-0.63)       |
| INT4+50%Sp            | 87.07 (-1.62)      | 4.65 (-0.45)          | 83.60 (-0.52)       |

N. Wang et al, NeurIPS 2022





https://research.ibm.com/blog/ibm-artificial-intelligence-unit-aiu

#### IBM Research AI Hardware Center / © 2023 IBM Corporation

### IBM Research AIU background



zAIU overview:

- One Gen-3 AI core, integrated in the z16
  processor chip
- Off-loads AI tasks from the 8 CPU cores
- Optimized for in-transaction AI inferencing
- Seamless integration into z software stack

### AIU overview:

- Complete AI accelerator, plugs into a standard PCIe slot
- 32 Gen-3 AI cores
- Optimized for AI inferencing, supports all operations for fine-tuning and training as well
- Designed to ease cloud integration, enabled in Red Hat stack
- Support for all common neural network types

# IBM Artificial Intelligence Unit (AIU)

SoC implements IBM's leadership innovations in lowprecision AI arithmetic and algorithms

- Chip architecture optimized for enterprise AI workloads, including foundation models
- Enabled in the Red Hat and Foundation Models software stacks
- Supports multi-precision inference (and some training) FP16, FP8, INT8, INT4, INT2
- Implemented in leading edge 5nm technology



# IBM AIU inference stack integrated with watsonx



# IBM AIU emulation overview

- Emulation systems have been essential for:
  - Hardware verification: Uncover functional/performance bugs
  - Software development: Provide platform for chip internal/external software development



### Synopsys ZeBu

- 96 Xilinx VU440 FPGAs
- Hardware verification
- Compiler / hardware codevelopment

Synopsys HAPS

- 4-8 Xilinx VU440 FPGAs
- Device driver development

# Full AIU computational emulation

- Objective: high-fidelity model of all computational elements – cores and interconnect – of the SoC
- Model build:
  - ZeBu system from Synopsys
  - 96 Xilinx VU440 FPGAs
  - Very high fill rate, ~90% LUT utilization
  - 24h model build time (RTL to bitfiles)
  - 1 1.5 MHz operating frequency; limited by memory interface
- Impact highlights:
  - Found several high impact hardware bugs
    - Rare, hard to hit scenarios, practically impossible to find in simulation
  - Vital for compiler development
  - Complete cycle-accurate processing of 1 image: 1 min on ZeBu vs. 9 hours in simulation



### Example

| Number of different NNs exercised      | 14          |
|----------------------------------------|-------------|
| Tests run (32 images/features per run) | 100,000     |
| Image/feature inferences completed     | 3.2 million |
| Total emulation run time               | 7000 hours  |
| Equivalent SoC run time                | 7 hours     |

# AIU nest emulation

### Why a second emulation platform?

• **Develop device driver stack for AIU**: require SoClike hardware fidelity (e.g., host-PCIe interface)

### Platform and model details:

- HAPS system from Synopsys
- 4-8 Xilinx VU440 FPGAs emulate a mini SoC
  - SoC faithful nest + 1 AI core (vs 32 AI cores)
  - Running at MHz speed
- Includes PCIe Gen5 PHY daughter card from Synopsys
- Includes DDR4 DIMMs
- Uniquely suited for AIU driver development
  - Faithfully realizes the host-PCIe interface of the SoC



| Network                 | HAPS runtime<br>(sec/image or<br>sec/feature) | ZeBu runtime<br>(sec/image or<br>sec/feature) |  |
|-------------------------|-----------------------------------------------|-----------------------------------------------|--|
| ResNet50                | 1.46                                          | 10.02                                         |  |
| MobileNetV1             | 0.59                                          | 3.37                                          |  |
| InceptionV4             | 4.35                                          | 43.76                                         |  |
| BERT-large<br>(seq=384) | 67                                            | 292                                           |  |

# Modeling and emulation impact

- Multiple software and FPGA-based methods have been essential to IBM's full-stack AIU and AI system development
- Our SoC design process leverages multiple levels of simulation for architecture development, logic and chip design, and design verification
- Our software stack development, accelerator software integration development, and compiler / hardware co-optimization leveraged FPGA-based emulation systems
  - Full-chip emulation via ZeBu for full-chip performance & accuracy analyses of AI models on multicore models, compiler optimizations, architectural modifications and power estimation
  - Detailed SoC nest emulation via HAPS for device driver development, low-level software stack development, and evaluation of multi-chip configurations
- These methods enabled us to develop a full system, end-to-end hardware and software stack for Foundation Model inference in parallel to SoC and PCIe card development

# Foundation Models are an inflection point for enterprise AI

- FMs enable a proliferation of taskspecific models, but with large and escalating compute demands
  - Inference, fine-tuning, and distributed training systems differing in requirements
  - Full-system innovation is required



- Our approach emphasizes:
  - Cloud-native architectures
  - Ease-of-use for developers and clients
  - Hybrid cloud consumption
  - AI accelerator design and technology innovations

|  |  | V | 1 |
|--|--|---|---|
|  |  |   |   |
|  |  |   |   |
|  |  |   |   |
|  |  | ¥ |   |