### **Sasank Chilamkurthy**

# Open Source GPU Stacks In the era of proprietary dominance





## Outline

- Proprietary dominance
- Computers
- Break the wall

# **Proprietary dominance**



## 465.07 USD +402.04 (637.85%) **↑** past 5 years Closed: Aug 1, 7:59 PM EDT • Disclaimer After hours 461.95 -3.12 (0.67%)



| 2021   | 2022       | 2023   |
|--------|------------|--------|
| 1.15T  | CDP score  | В      |
| 241.70 | 52-wk high | 480.88 |
| 0.034% | 52-wk low  | 108.13 |

## History of Al





enWell, this hackathonaOpensource ImageNetuteGPUs

## It all started with AlexNet



Neural Information Processing Systems https://proceedings.neurips.cc > paper > 4824-i...

#### ImageNet Classification with Deep Convolutional Neural ...

by A Krizhevsky · Cited by 119294 — We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the... 9 pages

### Year 2012 **100k citations!**



https://papers.nips.cc/paper\_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html



#### Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.



## Al runs on GPUs

- AI = matrix multiplications, which is massively parallelizable
- GPUs are great at parallel programming
- CPU < 32 cores/threads, GPU> 4000 cores/threads!
- CPU is 10x slower, at least
- Impractical to train or even run any reasonable AI model outside GPUs and ASICs

## **CUDA** is de facto standard

- CUDA is C-like language to program a GPU
- All AI programs are written in Nvidia's GPGPU language CUDA
- Works only on Nvidia GPUs
- Therefore AI stuff runs only on Nvidia GPUs
- Al hardware is **monopoly** because of lack of good compilers!



## AlexNet was done in CUDA of course

contain enough labeled examples to train such models without severe overfitting.

The specific contributions of this paper are as follows: we trained one of the largest convolutional neural networks to date on the subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012 competitions [2] and achieved by far the best results ever reported on these datasets. We wrote a highly-optimized GPU implementation of 2D convolution and all the other operations inherent in training convolutional neural networks, which we make available publicly<sup>1</sup>. Our network contains a number of new and unusual features which improve its performance and reduce its training time, which are detailed in Section 3. The size of our network made overfitting a significant problem, even with 1.2 million labeled training examples, so we used several effective techniques for preventing overfitting, which are described in Section 4. Our final network contains five convolutional and three fully-connected layers, and this depth seems to be important: we found that removing any convolutional layer (each of which contains no more than 1% of the model's parameters) resulted in inferior performance.

In the end, the network's size is limited mainly by the amount of memory available on current GPUs

| Project   | רע <mark>(P}</mark> כו                     |
|-----------|--------------------------------------------|
| Source    |                                            |
| Issues    | High-performa                              |
| Wikis     | networks<br>Note July 1                    |
| Downloads | called cuda-<br>training on k<br>training. |
|           | This is a fas                              |



High-performance C++/CUDA implementation of convolutional neural networks

Note July 18, 2014: \* I've released an update to cuda-convnet, called cuda-convnet2. The two main new features are faster training on Kepler-generation GPUs and support for multi-GPU training.

This is a fast C++/CUDA implementation of convolutional (or more generally, feed-forward) neural networks. It can model arbitrary layer connectivity and network depth. Any directed acyclic graph of layers will do. Training is done using the back-propagation algorithm.

Fermi-generation GPU (GTX 4xx, GTX 5xx, or Tesla equivalent) required.

#### Documentation

- Compiling how to check out and compile this code.
- · Data -- what kind of data this net can train on.
- LayerParams -- how to specify an architecture for the net.
- NeuronTypes types of hidden unit nonlinearities.
- TrainingNet how to train the net.
- · Options -- the command-line arguments that the net takes.
- ViewingNet -- how to look inside the checkpoints saved by the net.
- CheckingGradients -- how to numerically test the gradients for correctness.

## **PyTorch dominates Al frameworks** Written in C++ & CUDA but with Python API

The main structure of PyTorch in a architectural view is shown in the figure below.



Architecture. Inspired by <sup>2</sup>

https://se.ewi.tudelft.nl/desosa2019/chapters/pytorch/#fn:3

## **Nvidia H100 GPUs: Supply and Demand**

July 2023 · Updated: August 2023

## Is there really a **Bottleneck?**

#### How Many GPUs Are Needed?

- GPT-4 was likely trained on somewhere between 10,000 to 25,000 A100s.<sup>20</sup> •
- Meta has about 21,000 A100s, Tesla has about 7,000 A100s, and Stability AI has ۰ about 5,000 A100s.<sup>21</sup>
- Falcon-40B was trained on 384 A100s.<sup>22</sup>
- Inflection used 3,500 H100s for their GPT-3.5 equivalent model.<sup>23</sup> •

| Specifications                |                                                                                                                 |  |  |
|-------------------------------|-----------------------------------------------------------------------------------------------------------------|--|--|
| GPU                           | 8x NVIDIA H100 Tensor Core GPUs                                                                                 |  |  |
| GPU memory                    | 640GB total                                                                                                     |  |  |
| Performance                   | 32 petaFLOPS FP8                                                                                                |  |  |
| NVIDIA® NVSwitch <sup>™</sup> | 4x                                                                                                              |  |  |
| System power<br>usage         | 10.2kW max                                                                                                      |  |  |
| CPU                           | Dual Intel® Xeon® Platinum 8480C Processors<br>112 Cores total, 2.00 GHz (Base),<br>3.80 GHz (Max Boost)        |  |  |
| System memory                 | 2TB                                                                                                             |  |  |
| Networking                    | 4x OSFP ports serving 8x single-port NVIDIA<br>ConnectX-7 VPI<br>> Up to 400Gb/s InfiniBand/Ethernet            |  |  |
|                               | 2x dual-port QSFP112 NVIDIA ConnectX-7 VPI<br>> Up to 400Gb/s InfiniBand/Ethernet                               |  |  |
| Management<br>network         | 10Gb/s onboard NIC with RJ45<br>100Gb/s Ethernet NIC<br>Host baseboard management controller<br>(BMC) with RJ45 |  |  |
| Storage                       | OS: 2x 1.92TB NVMe M.2                                                                                          |  |  |
| Internal storage:             | 8x 3.84TB NVMe U.2                                                                                              |  |  |

### **500k USD / DGX H100 30k USD/Card** Almost a million H100s ordered for the next year

# **Computer Architecture**







## von Neumann architecture



https://chsasank.com/llm-system-design.html



Interprets proposed of memory as program and executes it on data



## **Basis of all Computers**





-> SLOT FOR CPU

(Traced the wires b/w)

-, 520T for menon

## **Basis of all GPUs**



https://simple.wikipedia.org/wiki/Video\_card

## **GPU Architecture** GPU = Multi core processors with support for hardware support for multi threading



### Note Memory Hierarchy

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0061892







# **Optimal hardware design**



Optimal FLOPs/BW = Batch Size



## **FLOPs/BW is getting worse** We need newer designs for inference



## Al Chips Cambrian Explosion



Edge vs Cloud. Open vs Closed.

Nathan Odle 🤣 @mov\_axbx · Apr 10

**Caption Contest** 





...

#### Cloud TPU





# Break the wall





# How to support newer architectures?

### **Common intermediate representation for all programming languages and hardware**





https://aosabook.org/en/v1/llvm.html



## Kernels are hard Another point of lock-in



## SYCL: A Portable Alternative to CUDA Standard, not implementation



## SYCL Works Great

| GPU                 | Matrix<br>Size | PortBlas<br>GFLOP/s | Vendor Libraries<br>GLOP/s | PortBlas/<br>Vendor |
|---------------------|----------------|---------------------|----------------------------|---------------------|
| Nvidia GTX<br>1650M | 1024           | 1284                | 1483                       | 87%                 |
|                     | 2048           | 2299                | 2700                       | 85%                 |
|                     | 4096           | 2475                | 1889                       | 131%                |
| AMD Van<br>Gogh     | 1024           | 451                 | 889                        | 51%                 |
|                     | 2048           | 911                 | 689                        | 132%                |
|                     | 4096           | 989                 | 1199                       | 82%                 |
| Intel Arc 770       | 1024           | 7210                | 5271                       | 137%                |
|                     | 2048           | 8473                | 1511                       | 561%                |
|                     | 4096           | 8408                | 16425                      | 51%                 |

## LLVM IR doesn't work **For GPUs**







In search for portable CUDA alternative, I found that LLVM doesn't really cut it as intermediate representation. Read why in my latest post: chsasank.com/intermediate-r...

### Intermediate Representations for GPUs: LLVM Does Not Cut it

Sasank Chilamkurthy | 05 April 2024 | 11 minutes to read.

 Compilers are like dragons, and wrapping my head around their complexity has been challenging. Adding to the challenge, I've chosen a particularly tough topic within this complexity: AI compilers. What sets AI apart are GPUs and matrix multiplication kernels. In this post, I will talk about compilers for GPUs and will leave matrix multiplication kernels to another post. In this post, we will examine LLVM compiler framework for CPUs and contrast it with for GPUs. We'll show that LLVM is not a reasonable IR for GPU.

#### How LLVM works

A good review of architecture of LLVM can be found in the book The Architecture of Open Source Applications. I reproduce a key diagram from the LLVM chapter below for reference:

V 272



3:31 PM · Apr 5, 2024 · 29.3K Views

**t**l 53

III View post engagements

Q 11





215

10:19





### Free as in Freedom



https://twitter.com/elonmusk/status/1765387202953937224



Compiler Discussion

·I I+

#### Manasij Mukherjee PhD

I guess more than gains it enables you to keep software portable and programming to familiar paradigms instead of having to learn new vect...

Potability was one of the main reason why programing languages were originally

12:25 1/

I think we're in Fortran days again. We don't have software portability for AI chips

12:27 1/ +91 70204 02120 I think we're in Fortran days again. We don't have software portability for AI MLIR is trying to solve that problem. 12:28 Manasij Mukherjee PhD +91 70204 02120 ~ Vedant Paranjape MLIR is trying to solve that problem. It feels like all the interesting users of MLIR are closed source. 12:28

#### Manasij Mukherjee PhD

It feels like all the interesting users of MLIR are closed source.

This is why GPL is important I guess :

12:29  $\leq$ 0



# "People who are really serious about software should make their own hardware."





## I build hardware





# **VON NEUMANN** AI