May 30, 2025 · 5 min read
Most articles are about finetuning models (or using them). Few are about optimizing the code that runs them. This series is a semi-organized dump of my resources and notes for learning GPU programming and intended to be a mini-resource for others that wish to break this abstraction layer.
While the idea of shifting to slightly lower level programming might seem daunting at first, knowing a few key concepts and resources can go a long way in understanding this seemingly esoteric topic.
you-can-just-do-things.jpg
Most introductions to GPUs start with this mythbuster video...
Mythbusters: GPU vs CPU; a visualization of parallel processing
It's a visualization of what parallel compute looks like, but we can do a lot better. Knowing a little more about the architecture and the hardware make-up of a GPU will help a lot with understanding why we're why make certain design choices and optimizations later.
GPUs consist of many streaming multiprocessors (SMs, which are sometimes also known as compute units). These each contain "cores", schedulers, and on-chip memory (more on these later).
For now, let's start with a bottom up view of how things work on a GPU: When a program (also known as a kernel) is launched on the GPU, thousands of threads are used to execute the program in parallel. These threads are organized hierachically to match the hardware structure.
GPU Kernel Execution Hierarchy
This hierarchy and organization is part abstraction and part grounded in the structure of the GPU hardware. They are necessary to begin understanding how programs are written for them. As a simplification, we write code that is executed by threads, and we use blocks/grids to run many copies of that code in parallel.
If you've ever bought a GPU or watched an NVIDIA announcement, you've probably seen something like this:
NVIDIA A100 Hardware Specifications
As mentioned earlier, each SM executes multiple warps simultaneously, scheduling instructions, and providing access to fast on-chip memory (shared memory and registers).
In terms of hardware, an SM includes:
Multiple SMs operate independently within a GPU to orchestrate thousands of concurrent threads in a GPU.
A typical GPU computation involves:
It's good to note here that kernels do not return anything. Kernels operate on pointers, referencing a position in GPU global memory where the data is located. The result of the kernel's computation is typically written to designated points in global memory.
While the idea of working with pointers might be daunting to some without a CS background, the use of pointers here is typically straightforward. It also makes it easy to write multiple outputs (including intermediate outputs) with a single kernel operation.
Triton allows you to reason at a higher level of abstraction: vectorized blocks, rather than individual threads. This trade-off is typically worth it and hence the hype around triton: getting 80% of the optimization performance with 20% of the work.