Accelerated Computing Refresher
GPUs, CUDA, and NVIDIA's stack — what accelerated computing is, when it helps, and how to start using it. Written for engineers who've never touched a GPU.
Table of Contents
1. Why Accelerated Computing?
For decades, your code got faster for free. Every 18 months, the new CPU was 2x as fast. You could ship slower code and rely on hardware to bail you out. That era ended around 2005.
CPU clock speeds hit a wall at roughly 4 GHz. Two fundamental limits collide: power density (faster clocks burn more power, which melts chips) and Dennard scaling (the property that let transistors run cooler as they shrank stopped working at small node sizes). Intel's "4 GHz barrier" from 2004 is still, in 2026, a barrier.
The answer the industry landed on: do many things at once instead of one thing faster. CPUs added more cores (2, 4, 8, 64). But GPUs took this to an extreme: thousands of simpler cores working in parallel.
The core insight: latency vs. throughput
CPUs are optimized for latency — getting one task done as fast as possible. They have large caches, branch predictors, out-of-order execution units, and speculative execution — all machinery that reduces how long a single thread waits.
GPUs are optimized for throughput — doing as many tasks simultaneously as possible. They sacrifice single-thread speed for the ability to run thousands of threads at once, all operating on different data.
Think of it this way: a CPU is a few elite sprinters. A GPU is a stadium of ordinary runners — slower individually, but together they cover vastly more ground.
CPU vs GPU at a glance
| Property | Modern CPU (e.g., AMD EPYC) | Modern GPU (e.g., H100 SXM5) |
|---|---|---|
| Core count | 8–128 cores | 2,000–20,000 CUDA cores |
| Clock speed | 3–5 GHz | 1–2 GHz |
| Memory bandwidth | 50–300 GB/s (DDR5) | 500–3,350 GB/s (HBM3e) |
| Memory capacity | Up to 6 TB (server) | 24–192 GB (VRAM) |
| Peak FLOPS | ~2 TFLOPS (FP32) | ~67 TFLOPS (FP32 CUDA cores), 989 TFLOPS (TF32 Tensor Cores) — H100 SXM5 |
| Best for | Sequential, branchy, latency-sensitive | Parallel, data-parallel, throughput |
| Programmed with | Any language | CUDA/HIP/SYCL or high-level frameworks |
Jensen Huang, NVIDIA's CEO, declared at GTC 2026: "All computing will be accelerated computing." Whether or not you believe the absolutism, the direction is clear. ML training, data pipelines, LLM inference, vector search — workloads that used to be CPU-bound are routinely 10–100x faster on GPU today.
2. GPU vs CPU Architecture
You know CPUs. Let's build GPU understanding from that foundation.
What makes a CPU fast for your typical code
When you call a function that does an if/else based on a database result, then calls another function based on that result — this is a control-flow-heavy, data-dependent workload. CPUs are engineered for exactly this pattern:
- Branch predictor: guesses which branch you'll take, starts executing it before the condition is known. 95%+ accurate on typical code.
- Out-of-order execution: reorders instructions to keep execution units busy while waiting on memory.
- Large L1/L2/L3 caches: a single CPU core might have 256KB L1 + 1MB L2 + 64MB L3. Getting data fast for the thread that needs it now.
- Speculative execution: runs code ahead of where you logically are, rolls back if wrong.
All of this silicon is in service of making one thread run fast.
What a GPU is actually doing
A GPU is built around a different assumption: you have a regular, data-parallel problem. The same operation applied to millions of independent data points. "Add these two arrays." "Multiply these matrices." "Apply this activation function to these 65,536 values."
GPUs strip out most of the complexity that makes CPUs fast at sequential code, and put that transistor budget into more compute units instead.
Key GPU architectural concepts you need to know:
Streaming Multiprocessors (SMs)
An H100 has 132 SMs. Each SM is something like a mini-CPU with its own set of CUDA cores, registers, shared memory, and schedulers. Think of them as departments in a very large company — each runs somewhat independently.
SIMT: Single Instruction, Multiple Threads
Within an SM, threads are grouped into warps of 32. All 32 threads in a warp execute the same instruction at the same time, but each on different data. This is SIMT: Single Instruction, Multiple Threads.
If thread 0 is adding element 0 of array A to element 0 of array B, thread 1 is simultaneously adding element 1 to element 1, ..., thread 31 is adding element 31. One instruction, 32 results produced in parallel.
Warp divergence: why branching hurts on GPUs
This is the single most important thing to understand about GPU performance.
Since 32 threads share an instruction, what happens when some threads need to take an if branch and others need to take the else? Both paths execute. The threads that aren't supposed to be on a given path are masked off (their results are discarded), but the time is spent. You get worst-case performance: sequential execution of both paths, zero parallelism.
Memory hierarchy: bandwidth over latency
GPUs have a very different memory structure than CPUs:
The key takeaway: GPU global memory (HBM) has enormous bandwidth — 3.35 TB/s on an H100. Your CPU can barely manage 200 GB/s on a high-end server. This is why matrix multiplications are so fast on GPU: they need to read a lot of data quickly, and HBM delivers it.
But the CPU-to-GPU transfer over PCIe is slow (~32 GB/s). Every time you move data from RAM to VRAM or back, you pay this cost. It's the primary reason small datasets don't benefit from GPU acceleration: the transfer overhead exceeds the compute savings.
| Property | CPU Core (Zen 4) | GPU SM (H100) |
|---|---|---|
| Per-core clock | ~5 GHz | ~1.8 GHz |
| L1 cache | 32–64 KB | 256 KB (shared/L1) |
| Branch prediction | Yes, very good | None (warps) |
| Out-of-order exec | Yes | No |
| Threads per unit | 1 (+ hyperthreading) | 2,048 concurrent threads per SM |
| Designed for | Low-latency, single-thread speed | High-throughput, data-parallel work |
3. The CUDA Programming Model
You almost certainly will never write raw CUDA code in your career. But understanding the mental model is what lets you reason about performance, debug memory errors, and pick the right tool for your problem. Think of this as reading a map before you drive.
The execution hierarchy: Grid → Block → Thread → Warp
When you launch a GPU computation, you're launching a kernel: a function that executes on the GPU, run by thousands of threads simultaneously.
Those threads are organized in a hierarchy:
- Thread: the basic unit of work. Each thread runs your kernel function once, with a unique ID.
- Block: a group of up to 1,024 threads. Threads in the same block can share memory and synchronize with each other.
- Grid: all the blocks launched by a single kernel call. Can be millions of blocks.
- Warp: 32 threads that physically execute together (hardware concept, not something you configure).
A concrete example: vector addition
The "hello world" of GPU programming. Add two arrays, element by element. On a CPU, you'd write a loop. On a GPU, each thread handles one addition:
// CUDA C++ kernel — each thread adds one element
__global__ void vector_add(float *a, float *b, float *result, int n) {
// Figure out which element this thread handles
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Guard: some blocks may be launched with threads beyond array end
if (idx < n) {
result[idx] = a[idx] + b[idx];
}
}
// Host code: launch the kernel
int main() {
int n = 1000000; // 1 million elements
float *d_a, *d_b, *d_result;
// ... allocate and populate h_a, h_b, h_result on host (CPU) side ...
// Allocate GPU memory
cudaMalloc(&d_a, n * sizeof(float));
cudaMalloc(&d_b, n * sizeof(float));
cudaMalloc(&d_result, n * sizeof(float));
// Copy data from CPU to GPU
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);
// Launch: 256 threads per block, enough blocks to cover all n elements
int threads = 256;
int blocks = (n + threads - 1) / threads; // = 3907 blocks
vector_add<<>>(d_a, d_b, d_result, n);
// Copy results back to CPU
cudaMemcpy(h_result, d_result, n * sizeof(float), cudaMemcpyDeviceToHost);
// 3,907 blocks × 256 threads = ~1M threads ran simultaneously
return 0;
}
Three phases: load (copy data to GPU), launch (run kernel), read back (copy results to CPU). This pattern is universal. The transfer cost is why small datasets often don't benefit.
Thread indexing: how threads know what to work on
Every thread can read three built-in variables to figure out which piece of data it should process:
threadIdx.x: thread's position within its block (0 to blockDim.x-1)blockIdx.x: which block this thread is inblockDim.x: how many threads are in each block
The global element index: int idx = blockIdx.x * blockDim.x + threadIdx.x. Block 0 threads handle elements 0–255, block 1 threads handle 256–511, etc.
__global__ functions in production ML or data work.
4. "My Pandas Code Takes 20 Minutes" — GPU for Data
This is the most accessible GPU use case for most software engineers. No ML knowledge required. Your existing pandas, Spark, or Polars code can often run dramatically faster on a GPU with minimal changes.
cudf.pandas: zero code changes, 10–150x faster
NVIDIA's cuDF library implements the pandas API on the GPU. The cudf.pandas accelerator is a drop-in: add one line at the top of your script, and pandas operations run on GPU transparently.
# Before: vanilla pandas, 20+ minutes on 10GB CSV
import pandas as pd
df = pd.read_csv("transactions_10gb.csv")
result = (
df.groupby(["customer_id", "product_category"])
.agg({"amount": ["sum", "mean", "count"]})
.reset_index()
.merge(df[["customer_id", "region"]].drop_duplicates(), on="customer_id")
)
result.to_csv("output.csv", index=False)
# After: cudf.pandas accelerator — identical code, GPU execution
import cudf.pandas # ← the only change
cudf.pandas.install()
import pandas as pd # pandas is now GPU-accelerated
df = pd.read_csv("transactions_10gb.csv")
result = (
df.groupby(["customer_id", "product_category"])
.agg({"amount": ["sum", "mean", "count"]})
.reset_index()
.merge(df[["customer_id", "region"]].drop_duplicates(), on="customer_id")
)
result.to_csv("output.csv", index=False)
# Runtime: ~25 seconds instead of 23 minutes (55x faster on T4 GPU)
# Install
pip install cudf-cu12 # for CUDA 12.x
# Or on Colab: !pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com
When cudf.pandas falls back to CPU
Not every pandas operation has a GPU implementation. cudf.pandas automatically falls back to CPU for unsupported operations, so your code never breaks — it just won't always be GPU-accelerated. Check with:
import cudf.pandas
cudf.pandas.install()
# At end of script, print what ran on GPU vs CPU:
cudf.pandas.profile()
Spark RAPIDS: 5x faster Spark, no code changes
If you're using Apache Spark for large-scale data processing, RAPIDS Accelerator for Spark is a plugin that runs Spark SQL operations on GPUs. It intercepts physical plan nodes and replaces CPU executors with GPU equivalents — zero code changes required, just a config flag.
# Add to spark-submit
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.enabled=true \
--conf spark.executor.resource.gpu.amount=1
# Or in PySpark:
spark = SparkSession.builder \
.config("spark.plugins", "com.nvidia.spark.SQLPlugin") \
.config("spark.rapids.sql.enabled", "true") \
.getOrCreate()
Benchmark: a typical ETL pipeline on 1TB of Parquet — Spark on CPU: 3.5 hours. Spark RAPIDS on 8x A100: 42 minutes. At cloud spot pricing, GPU actually costs less due to shorter runtime.
Polars GPU backend
# Polars 1.x with GPU backend
# Install (run in terminal or Jupyter cell with ! prefix):
# pip install polars[gpu] # installs cuDF dependency
import polars as pl
# Enable GPU engine
df = pl.scan_parquet("data/*.parquet")
result = (
df.filter(pl.col("amount") > 100)
.group_by("category")
.agg(pl.col("amount").sum())
.collect(engine="gpu") # ← run on GPU
)
Benchmarks: when GPU data processing pays off
| Operation | Data Size | CPU Time | GPU Time (T4) | Speedup |
|---|---|---|---|---|
| groupby + agg | 1 GB | 45s | 2s | 22x |
| join two tables | 5 GB + 500 MB | 3.5 min | 8s | 26x |
| sort + dedup | 2 GB | 90s | 4s | 22x |
| string ops (ILIKE) | 500 MB | 25s | 12s | 2x (strings are hard) |
| custom Python lambda | Any | baseline | baseline (fallback) | 1x (falls back to CPU) |
- Data under ~100 MB: CPU+RAM is fast enough; transfer overhead dominates
- I/O-bound pipelines: if you're waiting on network/disk, the GPU sits idle
- Custom Python lambdas:
df.apply(my_python_func)can't run on GPU — it needs Python interpreter per row - Streaming row-by-row: GPU excels at batch operations, not one-record-at-a-time processing
For more on cuDF and data transform tools, see the Data Transformation Refresher.
5. "Training Takes 3 Days" — GPU for ML
This is the bread-and-butter GPU use case. Training a neural network is fundamentally matrix multiplication repeated millions of times — exactly what GPUs are built for.
The two-line GPU upgrade in PyTorch
import torch
import torch.nn as nn
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}") # "cuda" on Colab with GPU runtime
# Define a model
model = nn.Sequential(
nn.Linear(784, 512),
nn.ReLU(),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
# Move model to GPU — this copies all parameters to VRAM
model = model.to(device)
# Training loop
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
for epoch in range(10):
for batch_x, batch_y in dataloader:
# Move data to same device as model
batch_x = batch_x.to(device) # ← CPU→GPU transfer happens here
batch_y = batch_y.to(device)
optimizer.zero_grad()
output = model(batch_x) # forward pass on GPU
loss = criterion(output, batch_y)
loss.backward() # backward pass on GPU
optimizer.step()
That's it for a single GPU. model.to("cuda") and data.to("cuda") — two patterns, one rule. All computation between those .to(device) calls runs on the GPU.
Mixed precision training: 2x faster, half the memory
By default, PyTorch uses FP32 (32-bit floats). Switching to FP16 or BF16 halves memory usage and roughly doubles throughput on modern GPUs (which have Tensor Cores designed for FP16/BF16 matmul). The torch.amp (automatic mixed precision) module handles this automatically:
from torch.amp import autocast, GradScaler
scaler = GradScaler("cuda") # handles FP16 gradient scaling (prevents underflow)
for batch_x, batch_y in dataloader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
optimizer.zero_grad()
# Forward pass in FP16 where safe, FP32 where precision matters
with autocast("cuda"):
output = model(batch_x)
loss = criterion(output, batch_y)
# Backward pass with scaled gradients
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
On an A100, BF16 training is typically 2–3x faster than FP32 with zero accuracy loss. On consumer GPUs (RTX 4090), FP16 with AMP gives similar benefits.
Multi-GPU training: DistributedDataParallel
import os
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
# Launch with: torchrun --nproc_per_node=4 train.py
dist.init_process_group(backend="nccl") # NCCL is the GPU comms library
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device(f"cuda:{local_rank}")
model = MyModel().to(device)
model = DDP(model, device_ids=[local_rank]) # wraps model for multi-GPU
# Each GPU processes a different shard of data
# DDP automatically averages gradients across all GPUs after backward()
# With 4xA100: effectively 4x larger batch, ~3.5x faster training
Training time comparison: CPU vs GPU
| Workload | CPU (32 cores) | Single T4 | Single A100 | 8x A100 |
|---|---|---|---|---|
| ResNet-50, 1 epoch (ImageNet) | ~6 hours | ~22 min | ~8 min | ~1 min |
| BERT-base fine-tune (MNLI) | ~18 hours | ~45 min | ~15 min | ~2 min |
| GPT-2 (117M) from scratch | days | ~8 hours | ~2.5 hours | ~20 min |
For more on PyTorch training patterns, see the PyTorch Refresher. For foundational ML concepts (loss functions, overfitting, regularization), see the Machine Learning Refresher.
6. "Serve 1000 LLM Requests/Second" — GPU for Inference
The hottest GPU use case in 2026. Serving large language models at scale is almost impossible without GPUs. Here's why, and how to do it.
Why LLMs need GPUs
A 7B parameter model has 7 billion floating-point numbers. At FP16 (2 bytes each), that's 14 GB just to store the weights — before you handle any requests. Each inference request involves multiplying these weights by the input embeddings repeatedly across 32+ transformer layers. That's billions of multiply-add operations per request.
On a CPU, generating one token for one user takes ~500ms. On a single A10G GPU, the same computation takes ~5ms — and the GPU can batch 50 concurrent users with minimal throughput penalty, making the effective throughput 100x higher than CPU.
vLLM: the standard for LLM serving
vLLM (Virtual LLM) is the dominant open-source LLM inference server. Its key innovation is PagedAttention: memory management for KV cache that allows batching requests without wasting VRAM on fixed-size allocations.
# Install and start serving Llama 3.1 8B
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.9 \
--max-model-len 8192 \
--dtype bfloat16
# Server starts on http://localhost:8000
# Compatible with OpenAI API format
# Call the vLLM server (OpenAI-compatible)
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain CUDA in one paragraph."}],
max_tokens=200,
)
print(response.choices[0].message.content)
# vLLM Python API (in-process, for batch jobs)
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct", dtype="bfloat16")
params = SamplingParams(temperature=0.7, max_tokens=200)
# Send 1000 prompts at once — vLLM batches them automatically
prompts = [f"Summarize: {text}" for text in texts]
outputs = llm.generate(prompts, params)
for output in outputs:
print(output.outputs[0].text)
TensorRT-LLM: maximum throughput
NVIDIA TensorRT-LLM compiles your model into an optimized inference engine using kernel fusion, quantization, and custom CUDA kernels. It's more complex to set up than vLLM but delivers higher peak throughput — 2–5x over naive PyTorch inference.
# TensorRT-LLM via Docker (recommended setup)
docker pull nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
# Convert Llama 3.1 8B to TRT engine
python examples/llama/convert_checkpoint.py \
--model_dir /models/llama-3.1-8b \
--output_dir /tmp/trt-llama-ckpt \
--dtype bfloat16
trtllm-build \
--checkpoint_dir /tmp/trt-llama-ckpt \
--output_dir /tmp/trt-llama-engine \
--gemm_plugin bfloat16 \
--max_batch_size 64 \
--max_input_len 2048 \
--max_output_len 512
Cost math: GPU vs CPU for LLM inference
| Setup | Throughput (tokens/sec) | Cost/hr | Cost per 1M tokens |
|---|---|---|---|
| CPU (64-core server, Llama 3.1 8B) | ~50 tok/s | $2.40 | $13.33 |
| T4 GPU (16 GB), vLLM | ~1,200 tok/s | $0.53 | $0.12 |
| A10G GPU (24 GB), vLLM | ~3,500 tok/s | $1.06 | $0.08 |
| A100 GPU (80 GB), TRT-LLM | ~8,000 tok/s | $3.20 | $0.11 |
CPU inference for LLMs is roughly 25–100x more expensive per token than GPU inference. For any production serving load above a few requests per day, GPU pays for itself quickly.
NVIDIA NIM: pre-packaged inference containers
NVIDIA NIM (NVIDIA Inference Microservices) packages optimized models into ready-to-deploy containers. You pull a container, it downloads the model and starts serving. No manual TensorRT-LLM setup required.
# Pull and run a NIM container for Llama 3.1 8B
export NGC_API_KEY=your_key_here
docker run -it --rm --gpus all \
-v ~/.cache/nim:/opt/nim/.cache \
-p 8000:8000 \
-e NGC_API_KEY \
nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
# Server is now running, OpenAI-compatible at port 8000
curl http://localhost:8000/v1/models
For more on LLMs and transformers, see the LLMs Refresher.
7. "Search 1M Vectors in 50ms" — GPU for Vector Search
If you're building RAG systems, semantic search, or recommendation engines, you're doing approximate nearest-neighbor (ANN) search over large vector collections. GPUs accelerate both index building and search dramatically.
CPU Faiss vs GPU Faiss
Meta's Faiss library is the standard for vector search. It has a GPU backend that can be plugged in with minimal code changes:
import numpy as np
import faiss
# Generate 1M vectors, 768 dimensions (typical embedding size)
d = 768
n = 1_000_000
vectors = np.random.randn(n, d).astype("float32")
# --- CPU version ---
index_cpu = faiss.IndexFlatL2(d)
index_cpu.add(vectors) # ~45 seconds
distances, indices = index_cpu.search( # ~8 seconds for 1000 queries
query_vectors, k=10
)
# --- GPU version ---
res = faiss.StandardGpuResources()
# Copies the already-populated CPU index to GPU — vectors are already added
index_gpu = faiss.index_cpu_to_gpu(res, 0, index_cpu) # moves to GPU 0
distances, indices = index_gpu.search( # ~1.0 seconds (8x faster than CPU)
query_vectors, k=10
)
cuVS: NVIDIA's next-gen vector search
cuVS (CUDA Vector Search) is NVIDIA's dedicated library for GPU-accelerated approximate nearest-neighbor search, now the recommended backend for large-scale deployments:
from cuvs.neighbors import cagra
import cupy as cp
# Move vectors to GPU
gpu_vectors = cp.asarray(vectors)
# Build CAGRA index (graph-based ANN, ~10x faster than CPU HNSW)
index_params = cagra.IndexParams(metric="sqeuclidean")
index = cagra.build(index_params, gpu_vectors)
# Search
search_params = cagra.SearchParams(itopk_size=64)
distances, neighbors = cagra.search(
search_params, index, query_gpu, k=10
)
# Results back as cupy arrays — stay on GPU for downstream processing
When GPU vector search makes sense
| Scenario | Recommendation |
|---|---|
| Under 100K vectors | CPU Faiss or pgvector is fast enough. Don't bother. |
| 100K–10M vectors, latency <100ms | GPU Faiss or cuVS pays off. Significant speedup. |
| Batch indexing (millions of new vectors/day) | GPU indexing is 5–10x faster. Reduces index rebuild time. |
| Real-time RAG (<50ms p99) | GPU search can achieve this; CPU struggles at scale. |
| Already have GPU for inference | Run vector search on same GPU — zero extra cost. |
A practical pattern: if you're already running vLLM on a GPU for LLM inference, run your vector search on the same GPU during idle inference time. GPU utilization becomes near-100% and you get both for the price of one.
8. "Real-Time Object Detection at 30 FPS" — GPU for Vision
Computer vision workloads — object detection, image classification, video analysis — were the original GPU killer app and remain a dominant use case.
YOLOv8 on GPU: 30 FPS vs 2 FPS
from ultralytics import YOLO
import cv2
# Load YOLOv8 nano model (fastest)
model = YOLO("yolov8n.pt")
# Run on GPU
cap = cv2.VideoCapture("video.mp4")
while True:
ret, frame = cap.read()
if not ret:
break
# model auto-detects GPU and uses it
results = model(frame, device="cuda:0", verbose=False)
# Draw bounding boxes
annotated = results[0].plot()
cv2.imshow("Detection", annotated)
# GPU: ~30 FPS at 1080p with yolov8n
# CPU: ~2 FPS at 1080p with yolov8n (15x slower)
if cv2.waitKey(1) == ord("q"):
break
cap.release()
# Install
pip install ultralytics # includes PyTorch + CUDA dependencies
# Benchmark
yolo benchmark model=yolov8n.pt imgsz=640 device=0 # GPU
yolo benchmark model=yolov8n.pt imgsz=640 device=cpu # CPU
Video pipeline: full GPU processing
For production video pipelines, you want everything on the GPU — decode, preprocess, inference, and encode — to avoid the PCIe bottleneck between steps:
import torch
import torchvision.transforms as T
from torchvision.io import read_video # CPU video decode, then transfer to GPU
# Decode on CPU, then move to GPU tensor
# Requires: pip install torchvision[video]
frames, _, _ = read_video("clip.mp4", output_format="TCHW")
frames = frames.to("cuda").float() / 255.0 # (T, C, H, W) on GPU
# Batch preprocess on GPU
transform = T.Compose([
T.Resize((640, 640)),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Process all frames as a batch — no loop needed
frames_preprocessed = transform(frames) # GPU, vectorized
# Run inference on all frames at once
with torch.no_grad():
with torch.amp.autocast("cuda"): # PyTorch 2.0+ API
predictions = model(frames_preprocessed)
GPU vision use cases by industry
| Industry | Workload | Why GPU |
|---|---|---|
| Autonomous vehicles | Multi-camera real-time detection | 30 FPS × 8 cameras × 10 models simultaneously |
| Medical imaging | 3D MRI/CT segmentation | Large 3D volumes, batch processing overnight |
| Retail analytics | Store camera foot-traffic | 50+ camera streams, on-prem edge GPU |
| Satellite imagery | Change detection over regions | Terabytes of images, batch classification |
| Manufacturing QA | Defect detection on assembly line | Sub-10ms latency requirement, GPU achieves 2ms |
9. "Monte Carlo on a Million Paths" — GPU for Scientific Computing
Scientific and quantitative workloads — simulations, numerical methods, financial modeling — often map cleanly to GPU architecture because they involve the same operation applied to many independent data points.
CuPy: NumPy on GPU, drop-in replacement
import numpy as np
import cupy as cp # pip install cupy-cuda12x
import time
# CPU version
x_cpu = np.random.randn(10_000_000).astype(np.float32)
start = time.time()
result_cpu = np.fft.fft(x_cpu)
print(f"CPU FFT: {time.time() - start:.3f}s") # ~0.8s
# GPU version — identical code, just cp instead of np
x_gpu = cp.asarray(x_cpu) # copy to GPU once
start = time.time()
result_gpu = cp.fft.fft(x_gpu) # GPU computation
cp.cuda.Stream.null.synchronize()
print(f"GPU FFT: {time.time() - start:.3f}s") # ~0.012s (66x faster)
# CuPy array behaves like numpy array
print(result_gpu.shape) # (10000000,)
print(type(result_gpu)) # cupy.ndarray
result_back = cp.asnumpy(result_gpu) # back to CPU if needed
Monte Carlo option pricing: 100x faster
Monte Carlo simulation is a textbook GPU workload: simulate millions of independent random paths, then aggregate. Perfect SIMT fit — no dependencies between paths.
import cupy as cp
def monte_carlo_option_gpu(S0, K, r, sigma, T, n_paths=1_000_000, n_steps=252):
"""
European call option pricing via Monte Carlo.
S0: initial stock price
K: strike price
r: risk-free rate
sigma: volatility
T: time to expiry (years)
n_paths: number of simulation paths
n_steps: daily steps
"""
dt = T / n_steps
# Generate all random numbers at once on GPU
# Shape: (n_paths, n_steps) — each row is one simulation path
Z = cp.random.standard_normal((n_paths, n_steps), dtype=cp.float32)
# Compute all paths simultaneously (vectorized, no Python loop)
log_returns = (r - 0.5 * sigma**2) * dt + sigma * cp.sqrt(dt) * Z
# Cumulative sum along time dimension = log price path
log_prices = cp.log(S0) + cp.cumsum(log_returns, axis=1)
final_prices = cp.exp(log_prices[:, -1])
# Payoff for each path
payoffs = cp.maximum(final_prices - K, 0.0)
# Discount and average
option_price = cp.exp(-r * T) * cp.mean(payoffs)
return float(option_price)
# Price an option: S=100, K=105, r=5%, sigma=20%, T=1yr
price = monte_carlo_option_gpu(100, 105, 0.05, 0.20, 1.0)
print(f"Option price: ${price:.4f}")
# Timing (A100):
# GPU (1M paths): ~0.03 seconds
# CPU (NumPy, 1M paths): ~3.1 seconds (~100x faster on GPU)
Other scientific domains benefiting from GPU
| Domain | Workload | GPU library | Typical speedup |
|---|---|---|---|
| Molecular dynamics | Protein folding simulations | GROMACS, OpenMM | 50–100x |
| Weather modeling | Atmospheric simulation | CUDA Fortran, cuSPARSE | 10–30x |
| Computational fluid dynamics | Flow simulation, FEM | CUDA, AmgX | 20–50x |
| Seismic processing | Subsurface imaging | cuFFT, custom CUDA | 30–100x |
| Financial risk | VaR, CVA, Monte Carlo | CuPy, custom CUDA | 50–200x |
| Genomics | Sequence alignment (PARABRICKS) | NVIDIA Clara Parabricks | 50x (hours → minutes) |
10. NVIDIA's Full Stack
NVIDIA is unusual in that it sells not just hardware but a tightly integrated hardware-to-application stack. Understanding the layers helps you know where your code sits and which NVIDIA tools are relevant to you.
Hardware: the GPU lineup
| GPU | Tier | VRAM | FP16 TFLOPS | Memory BW | Best use case |
|---|---|---|---|---|---|
| H100 SXM5 Current flagship | Data center | 80 GB HBM3 | 989 | 3.35 TB/s | LLM training/inference, large-scale ML |
| B200 2025 | Data center | 192 GB HBM3e | 2,250 | 8 TB/s | Next-gen LLM, future workloads |
| A100 | Data center | 40/80 GB HBM2e | 312 * | 2 TB/s | Most production ML today |
| L4 | Data center (inference) | 24 GB GDDR6 | 242 | 300 GB/s | Efficient inference, video transcoding |
| RTX 4090 | Desktop/workstation | 24 GB GDDR6X | 330 | 1 TB/s | Research, local LLM dev, gaming |
| T4 | Data center (budget) | 16 GB GDDR6 | 130 | 320 GB/s | Free Colab tier, affordable inference |
| Jetson Orin | Edge | 64 GB unified | 275 (INT8) | 204 GB/s | Autonomous vehicles, robotics, IoT |
* A100 312 TFLOPS FP16 figure is with 2:4 structured sparsity enabled. Dense FP16 throughput is ~77 TFLOPS. Sparsity requires the model's weight matrices to be pruned to the 2:4 sparse pattern before benefiting.
NVLink and NVSwitch: multi-GPU fabric
PCIe (the slot that connects GPU to CPU) runs at ~64 GB/s bidirectional. That's too slow when 8 GPUs need to share gradients during training. NVLink is a direct GPU-to-GPU interconnect; NVLink 4.0 provides 900 GB/s total per GPU (18 links × 50 GB/s bidirectional each). NVSwitch is a chip that creates an all-to-all mesh — any GPU can talk to any GPU at full NVLink bandwidth. An HGX H100 system (8 GPUs + NVSwitches) has 7.2 TB/s of GPU-to-GPU bandwidth, making it effectively one logical compute unit.
CUDA: the foundation
CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and API, first released in 2007. It's what makes GPUs programmable for general computation (not just graphics). Everything else in the stack runs on top of CUDA.
Key insight: you almost never write CUDA directly. PyTorch, cuDF, and TensorRT have already written thousands of highly optimized CUDA kernels. You use their Python/C++ APIs and get GPU acceleration automatically.
Key libraries you'll encounter
| Library | What it does | Used by |
|---|---|---|
| cuDNN | Optimized primitives for deep learning (conv, pool, norm) | PyTorch, TensorFlow |
| cuBLAS | GPU-accelerated BLAS (matrix multiply at its core) | Everything that does linear algebra |
| cuFFT | Fast Fourier Transform on GPU | Signal processing, audio ML |
| NCCL | GPU collective comms (all-reduce, broadcast) | Multi-GPU/multi-node training |
| cuSPARSE | Sparse matrix operations on GPU | GNNs, sparse models, scientific sim |
| Thrust | GPU parallel algorithms (sort, reduce, scan) | cuDF under the hood |
| Triton (OpenAI) | Python-based custom GPU kernel language | Flash Attention, custom ops |
11. When NOT to Use a GPU Critical
GPU cargo-culting is real. I've seen engineers reach for GPU acceleration for web scraping, small CSV processing, and microservices with 10 req/s — and make their system slower, more expensive, and harder to operate. Here's when to stay on CPU.
1. I/O-bound workloads
If your bottleneck is network latency, database queries, or disk reads, the GPU sits idle 95% of the time. You're paying for a Ferrari to wait at red lights.
Example: A web scraper that fetches 100 URLs, extracts text, calls an external API, and saves results. Bottleneck: network I/O (100–500ms per URL). GPU utilization: ~0%.
Test: Run nvidia-smi while your job is running. If GPU utilization is below 50% most of the time, you're I/O-bound.
2. Small datasets
The CPU-to-GPU PCIe transfer costs real time. For a 10 MB DataFrame, the transfer itself takes ~0.3ms. The GPU computation might take 1ms. But loading the same 10 MB from L3 cache into CPU cores takes ~0.01ms, and the computation might take 5ms. GPU total: 1.3ms. CPU total: 5ms. At this scale, GPU wins — but barely, and only if your data is already warm.
As a rough rule: if your dataset fits comfortably in CPU L3 cache (tens of MB), or if transfer time is more than 30% of your total compute time, GPU probably doesn't help.
3. Branch-heavy, control-flow-heavy code
Code with many if/else branches that different data items take different paths through is a poor fit. Warp divergence means every thread in a warp must execute both branches serially.
Example: Decision tree inference (many branches based on feature values). GPU implementations exist but are complex; CPU often wins for single-tree inference.
4. Sequential algorithms
Algorithms where step N depends on the output of step N-1 can't be parallelized. A linked list traversal, a recursive depth-first search, a sequential state machine — these run on a single thread and gain nothing from GPU parallelism.
Rule of thumb: if your algorithm can't be expressed as "apply this function to all elements independently," it probably doesn't GPU-accelerate well.
5. Microsecond-latency requirements
CUDA kernel launch has overhead: ~5–10 microseconds just to start a kernel, even a trivial one. For high-frequency trading, real-time control systems, or anything where sub-100μs response is required, this overhead is unacceptable.
HFT, for example: stays almost entirely on CPU + FPGA. The determinism and sub-microsecond latency of CPU cache-resident code is irreplaceable.
6. Sporadic, low-frequency jobs
If your batch job runs once per day and takes 15 minutes, renting a GPU for 15 minutes ($0.05) vs CPU for 2 hours ($0.20) saves $0.15/day — $55/year. Not worth the operational complexity of managing GPU instances unless you have many such jobs.
Decision tree: should you use a GPU?
12. Your First GPU Program
The fastest path from zero to running GPU code, depending on your background.
Easiest: Google Colab (free, no setup)
# In Google Colab:
# Runtime → Change runtime type → T4 GPU → Save
# Verify you have a GPU
import torch
print(torch.cuda.is_available()) # True
print(torch.cuda.get_device_name(0)) # NVIDIA T4
import subprocess
subprocess.run(["nvidia-smi"]) # Shows GPU memory and utilization
Data engineer: cudf.pandas in 5 minutes
# On Colab with GPU runtime:
!pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com -q
import cudf.pandas
cudf.pandas.install()
import pandas as pd
import numpy as np
# Generate 10M row dataset
n = 10_000_000
df = pd.DataFrame({
"user_id": np.random.randint(0, 100000, n),
"amount": np.random.randn(n) * 100,
"category": np.random.choice(["A", "B", "C", "D"], n),
})
# This now runs on GPU — same pandas API
result = df.groupby(["user_id", "category"]).agg(
total=("amount", "sum"),
avg=("amount", "mean"),
count=("amount", "count"),
).reset_index()
print(result.head())
print(f"Rows in result: {len(result):,}")
ML engineer: MNIST on GPU (30 minutes)
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Data
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_data = datasets.MNIST(".", train=True, download=True, transform=transform)
loader = DataLoader(train_data, batch_size=256, shuffle=True, num_workers=2)
# Model
model = nn.Sequential(
nn.Flatten(),
nn.Linear(784, 256), nn.ReLU(), nn.Dropout(0.2),
nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.2),
nn.Linear(128, 10)
).to(device)
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
# Train 5 epochs
for epoch in range(5):
correct = 0
for x, y in loader:
x, y = x.to(device), y.to(device) # ← GPU transfer
optimizer.zero_grad()
out = model(x)
loss = criterion(out, y)
loss.backward()
optimizer.step()
correct += (out.argmax(1) == y).sum().item()
print(f"Epoch {epoch+1}: {correct/len(train_data)*100:.1f}% accuracy")
# On GPU: ~15 seconds total. On CPU: ~3 minutes.
LLM engineer: vLLM quickstart
# Requires GPU with at least 16 GB VRAM (A10G, A100, or L4)
pip install vllm
# Serve Mistral 7B
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--dtype bfloat16 \
--max-model-len 4096
# Test it
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"mistralai/Mistral-7B-Instruct-v0.3",
"messages":[{"role":"user","content":"What is CUDA?"}],
"max_tokens":200}'
Cloud options for getting a GPU
| Platform | GPU | Price | Best for |
|---|---|---|---|
| Google Colab | T4 (free), A100 (Pro) | Free / $10/mo | Learning, prototyping |
| Kaggle Notebooks | T4 / P100 | Free (30 hrs/wk) | Learning, competitions |
| Lambda Labs | A10, A100, H100 | $0.50–$3.50/hr | Training, serious dev |
| Vast.ai | Wide variety | $0.20–$2/hr | Budget training |
| AWS (g4dn.xlarge) | T4 | $0.53/hr on-demand, $0.16 spot | Production inference |
| GCP (a2-highgpu-1g) | A100 40 GB | $3.67/hr on-demand, $1.10 spot | Production training |
13. GPU Memory Management
The number one source of production GPU headaches. Unlike system RAM (which can use swap to extend beyond physical limits), GPU VRAM is strictly bounded. Exceed it and you get an out-of-memory error, not a slowdown.
The error every ML engineer knows
# The most common GPU error you will encounter:
# RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
# (GPU 0; 15.90 GiB total capacity; 13.47 GiB already allocated)
This happens because you tried to allocate more memory than the GPU has available. Common causes: batch size too large, model too large for your GPU, memory not being freed from previous iterations, or multiple processes sharing the same GPU.
How GPU memory gets used in a training job
import torch
# Check current GPU memory usage
def print_memory_stats():
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
print(f"Allocated: {allocated:.2f} GB | Reserved: {reserved:.2f} GB")
# Memory is consumed by:
# 1. Model parameters (fixed, based on model size)
# 2. Gradients (same size as parameters, during backward)
# 3. Optimizer states (2x parameters for Adam: m + v)
# 4. Activations (proportional to batch size × model depth)
# 5. PyTorch's own allocator overhead
# For a 7B parameter model in FP16:
# Parameters: 7B × 2 bytes = 14 GB
# Gradients: = 14 GB
# Adam optimizer: = 28 GB (2 states × 14 GB)
# Total minimum: = 56 GB ← why H100 80GB is popular
Strategies to reduce memory usage
import torch
from torch.amp import autocast # PyTorch 2.0+ API
# Strategy 1: Reduce batch size (most direct lever)
# batch_size = 256 → OOM
# batch_size = 32 → works. Use gradient accumulation to simulate large batch:
accumulation_steps = 8
optimizer.zero_grad()
for i, (x, y) in enumerate(loader):
x, y = x.to(device), y.to(device)
with autocast("cuda"):
output = model(x)
loss = criterion(output, y) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
# Strategy 2: Mixed precision (halves activation memory)
# FP32 → FP16: 2x memory reduction for activations
with autocast("cuda", dtype=torch.bfloat16):
output = model(x)
# Strategy 3: Gradient checkpointing (trade compute for memory)
# Recomputes activations during backward instead of storing them
# ~30% slower but saves 60-70% activation memory
from torch.utils.checkpoint import checkpoint
output = checkpoint(model.layers[0], x) # per layer
# For transformer models (HuggingFace):
model.gradient_checkpointing_enable()
# Strategy 4: Clear cache between phases
torch.cuda.empty_cache() # releases unused reserved memory (not allocated)
del tensor_you_no_longer_need # explicit deletion + Python GC
Monitoring GPU memory
# Watch GPU stats in terminal (refresh every 0.5s)
watch -n 0.5 nvidia-smi
# One-liner for memory summary
nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu \
--format=csv
# Install gpustat for a nicer view
pip install gpustat
gpustat --watch # colored, per-process breakdown
# Detailed per-tensor memory breakdown (great for debugging OOM)
print(torch.cuda.memory_summary(device=None, abbreviated=False))
# Profile memory during training to find the leak
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA],
profile_memory=True
) as prof:
output = model(x)
loss = criterion(output, y)
loss.backward()
print(prof.key_averages().table(sort_by="cuda_memory_usage"))
PCIe bottleneck: minimize CPU↔GPU transfers
# BAD: transferring small tensors one at a time
for item in large_list:
tensor = torch.tensor(item).to("cuda") # PCIe transfer per item
result = model(tensor)
# GOOD: batch your transfers
batch = torch.tensor(large_list) # one large CPU tensor
batch_gpu = batch.to("cuda") # one transfer — 32 GB/s vs 32x overhead
results = model(batch_gpu)
# GOOD: pin memory for faster CPU→GPU transfers
loader = DataLoader(
dataset,
batch_size=256,
pin_memory=True, # allocates page-locked memory (~1.5x faster transfers)
num_workers=4, # load CPU data in background while GPU computes
)
- Data parallelism (DDP): same model on each GPU, different data shards. Gradients averaged. Requires model fits on one GPU.
- Tensor parallelism: split individual matrices across GPUs. Used by Megatron-LM. Complex to implement.
- Pipeline parallelism: different transformer layers on different GPUs. DeepSpeed ZeRO handles this. Middle ground.
14. Cost & ROI
Before renting a GPU, you need numbers. Here are real ones.
GPU cloud pricing (2026)
| GPU | VRAM | On-demand ($/hr) | Spot ($/hr) | 1-yr reserved ($/hr) | Provider |
|---|---|---|---|---|---|
| T4 | 16 GB | $0.35–0.53 | $0.10–0.16 | $0.22 | Lambda, AWS, GCP |
| A10G | 24 GB | $0.75–1.10 | $0.25–0.40 | $0.45 | AWS, Lambda |
| A100 (40 GB) | 40 GB | $2.00–3.00 | $0.60–1.00 | $1.20 | Lambda, GCP, AWS |
| A100 (80 GB) | 80 GB | $2.50–4.00 | $0.75–1.50 | $1.50 | Lambda, Azure, GCP |
| H100 (80 GB) | 80 GB | $3.50–7.00 | $1.50–3.00 | $2.00–2.50 | CoreWeave, Lambda, Azure |
| H100 (8x DGX) | 640 GB | $28–56/hr | $12–24/hr | $16–20/hr | CoreWeave, GCP, Azure |
ROI scenarios
| Workload | CPU cost | GPU cost | GPU savings | Break-even |
|---|---|---|---|---|
| LLM inference (1M tokens/day, Llama 8B) | $12.00/day (CPU cluster) | $0.50/day (T4, spot) | 96% cheaper | Immediate |
| Data processing (1 TB/day ETL) | $18/day (large CPU instance) | $3/day (8x T4 spot, 45 min) | 83% cheaper | Immediate |
| ML training (1B param model) | $240 (CPU, 4 days) | $18 (A100, 6 hrs) | 92% cheaper | Immediate |
| Batch image processing (10M images/day) | $8/day | $1.50/day | 81% cheaper | Immediate |
| Small analytics job (10 GB, twice/week) | $0.40/week (c5.2xlarge) | $0.80/week (G4dn, setup overhead) | CPU is 2x cheaper | Never |
When GPU is NOT worth it (cost perspective)
- Jobs that run <2 hours/week: operational overhead (driver versions, CUDA compatibility, OOM debugging) costs more in engineer time than the savings.
- Data under 500 MB: a fast CPU instance is often cheaper and simpler.
- Highly variable, unpredictable workloads: GPU reserved pricing is only worthwhile at high utilization (70%+).
- Team has no GPU expertise: onboarding cost. If nobody on the team has debugged a CUDA OOM error, factor in learning time.
Cost optimization tips
- Spot instances: 50–70% off on-demand. Use for training (can checkpoint), not for always-on inference.
- Right-size your GPU: a T4 at $0.35/hr often handles batch inference jobs that engineers assume need an A100. Test before paying 10x more.
- Batch requests: GPU throughput is highest with large batches. A server handling 1 req/s at batch_size=1 might use 5% GPU utilization. Batch 20 requests together and you use 90% at the same cost.
- Quantization: INT8 models run in half the VRAM and twice the throughput vs FP16. Use llm.int8() in bitsandbytes or TensorRT INT8 for inference.
15. The Vera Rubin Generation (GTC 2026) 2026
NVIDIA's GTC 2026 announcements define the direction for the next 2–3 years. Here's what's coming and what it means for you.
Blackwell vs. Vera Rubin: architecture timeline
These three product names are often conflated — here's the precise breakdown:
- B200 — Blackwell GPU die (standalone data center GPU, 192 GB HBM3e). Current flagship, shipping 2025.
- GB200 — Grace Blackwell: a single package combining the B200 GPU die with a Grace ARM CPU. Eliminates PCIe bottleneck — unified memory pool at ~900 GB/s CPU↔GPU bandwidth instead of ~32 GB/s via PCIe.
- Vera Rubin — the next architecture after Blackwell (announced GTC 2026, shipping H2 2026–2027). Vera Rubin succeeds Blackwell the way Blackwell succeeded Hopper (H100). Its GPU die (~2,250 TFLOPS FP16) will appear in both standalone and Grace-paired (GR200) form factors.
The 192 GB HBM3e VRAM spec applies to the B200 (standalone) and the GB200 (Grace Blackwell) package today. Vera Rubin will bring further capacity and bandwidth increases.
BlueField-4 DPU
The BlueField-4 DPU (Data Processing Unit) offloads networking and security operations from the CPU/GPU. In a dense GPU cluster, a significant portion of CPU time is consumed by network protocol handling, encryption, and storage I/O. DPUs move this to dedicated silicon, freeing the GPU and CPU for computation. For ML clusters, this means higher effective GPU utilization at scale.
NVLink at rack scale
With NVLink Switch Systems, NVIDIA is scaling from 8 GPUs connected (HGX today) to entire racks and eventually buildings of GPUs acting as a unified compute fabric. The NVLink 5 system (announced 2026) connects up to 576 GPUs with 1.8 TB/s per-GPU bandwidth — enough to train models too large to fit on any single cluster today.
"Huang's Law"
Unlike Moore's Law (transistor density), Huang's Law describes GPU performance gains from three compounding factors: new compute architectures, increased memory bandwidth, and improved interconnects. NVIDIA has delivered roughly 1,000x AI performance improvement per decade — faster than traditional semiconductor scaling.
Platform integration: where GPUs are appearing
| Platform | GPU integration | For you |
|---|---|---|
| Snowflake | GPU-accelerated queries (RAPIDS inside) | SQL queries run on GPU, no config change |
| Databricks | GPU clusters for ML, Spark RAPIDS | Check "GPU enabled" when creating cluster |
| Google BigQuery | GPU-accelerated ML functions | ML.PREDICT can use GPU backend |
| AWS SageMaker | Managed GPU training + inference | Select GPU instance type in training job config |
| Azure ML | GPU compute clusters, NVIDIA AI Enterprise | Standard for enterprise ML in Azure ecosystem |
The implication for your career: GPU skills are transitioning from "ML specialist" to "data engineer and backend engineer baseline knowledge." The platforms you already use are quietly adding GPU acceleration under the hood, and understanding the model helps you use them effectively.
16. Key Terminology Glossary
Compute concepts
| Term | Definition |
|---|---|
| CUDA core | Basic FP32 arithmetic unit on NVIDIA GPU. H100 has 16,896. Not the same as a CPU core — much simpler, designed for throughput not latency. |
| Tensor Core | Specialized unit for matrix multiply-accumulate in mixed precision (FP16/BF16/INT8/FP8). Much faster than CUDA cores for deep learning. H100 has 528 Tensor Cores. |
| SM (Streaming Multiprocessor) | The major organizational unit of a GPU — like a CPU core, but with many more arithmetic units. H100 has 132 SMs, each with 128 CUDA cores. |
| Warp | Group of 32 threads that execute the same instruction in lock-step (SIMT). The fundamental scheduling unit of a GPU SM. |
| Block | Programmer-defined group of 1–1,024 threads that share shared memory and can synchronize. Scheduled to run on a single SM. |
| Grid | The full set of blocks launched for a single kernel call. Can contain millions of blocks. |
| Kernel | A function executed on the GPU by many threads simultaneously. Written in CUDA C/C++ or Triton, called from host (CPU) code. |
| Occupancy | Ratio of active warps to maximum possible warps on an SM. Higher occupancy generally means better latency hiding. Low occupancy = underutilized GPU. |
| Kernel fusion | Combining multiple GPU operations into a single kernel to reduce memory round-trips. Key optimization in TensorRT and FlashAttention. |
Memory concepts
| Term | Definition |
|---|---|
| HBM (High Bandwidth Memory) | 3D-stacked DRAM used as GPU VRAM. HBM3 achieves 3.35 TB/s bandwidth — 10x+ faster than DDR5 system RAM. |
| Shared memory | Fast on-chip SRAM shared by all threads in a block. ~96 KB on H100. The programmer controls what goes here. Think of it as a manually managed L1 cache. |
| Pinned memory | CPU RAM that is page-locked and not swappable. CPU→GPU transfers from pinned memory are ~2x faster than pageable memory. |
| Memory coalescing | When 32 threads in a warp access consecutive memory addresses, they're served in a single memory transaction. Non-coalesced access (scattered addresses) requires multiple transactions — major performance penalty. |
| NVLink | Direct GPU-to-GPU interconnect, bypassing PCIe. NVLink 4.0: 900 GB/s. Used in multi-GPU training to share gradients efficiently. |
| NVSwitch | Switch chip that creates all-to-all NVLink topology in DGX systems. Every GPU can communicate with every other GPU at full NVLink speed. |
| PCIe | Standard bus connecting GPU to CPU. Gen 5: 64 GB/s. The bottleneck for CPU↔GPU data transfer — 50x slower than NVLink. |
Precision formats
| Format | Bits | Range | Use case |
|---|---|---|---|
| FP64 | 64 | ±10±308 | Scientific computing where precision matters. CPUs are great; GPUs are slower. |
| FP32 | 32 | ±10±38 | Default for training. Safe, less likely to overflow or underflow. |
| TF32 | 19 (internally) | Same as FP32 | NVIDIA internal format for Tensor Cores. Auto-used in PyTorch by default. Same accuracy as FP32, 3x faster. |
| BF16 | 16 | Same range as FP32 | Preferred for LLM training. Same exponent as FP32 (no overflow), lower mantissa precision. Supported on A100+. |
| FP16 | 16 | ±65,504 | Mixed precision training on older GPUs (V100, T4). Smaller range than BF16 — requires gradient scaling to prevent underflow. |
| FP8 | 8 | Limited | Inference quantization. H100/H200 support FP8 Tensor Cores natively. ~2x speed vs FP16 with careful calibration. |
| INT8 | 8 | -128 to 127 | Post-training quantization for inference. Up to 4x throughput vs FP32. Requires calibration dataset. TensorRT INT8 is standard. |
| INT4 | 4 | -8 to 7 | Aggressive quantization (llama.cpp, GGUF format). Fits 70B model on 48 GB VRAM. ~1% accuracy loss vs FP16. |
Performance metrics
| Term | Definition |
|---|---|
| FLOPS | Floating-Point Operations Per Second. Measures compute throughput. |
| TFLOPS | Tera-FLOPS = 1012 FLOPS. H100: 67 TFLOPS (FP32), 989 TFLOPS (TF32), 1,979 TFLOPS (FP16). |
| PFLOPS | Peta-FLOPS = 1015 FLOPS. Top500 supercomputer benchmark uses PFLOPS. A DGX H100 (8 GPUs): ~32 PFLOPS (FP16). |
| MFU (Model FLOPs Utilization) | Percentage of theoretical peak FLOPS your training actually achieves. 40–60% MFU is good. Lower means memory-bound or poorly batched. |
| Quantization | Reducing numeric precision of model weights (FP32 → INT8 → INT4) to shrink VRAM usage and increase throughput. Slight accuracy trade-off. |
17. Learning Roadmap
A structured path from "never used a GPU" to "production GPU engineer," organized by week with concrete deliverables.
Week 1: First contact — understand the model
- Set up Google Colab with GPU runtime. Run
nvidia-smi, check CUDA version. - Run the MNIST training example from Section 12. Measure GPU vs CPU time.
- Experiment: change batch size from 32 to 1024. What happens to GPU utilization? Speed?
- Read the CUDA Programming Guide's first 3 chapters (free on NVIDIA docs). Just for mental model, no coding required.
- Deliverable: A training loop that runs on both CPU and GPU, with timing comparison printout.
Week 2: Real workloads — memory and performance
- Train a real model: fine-tune
distilbert-base-uncasedon a sentiment dataset (HuggingFace Trainer makes this ~20 lines). - Intentionally OOM your GPU: set batch_size very high. Read the error. Fix it with gradient accumulation.
- Enable mixed precision (
torch.amp). Measure speed and memory difference vs FP32. - Run
torch.cuda.memory_summary()and understand the output. - Deliverable: Fine-tuned sentiment model saved to disk, training script with AMP and memory profiling.
Week 3: Data and inference acceleration
- Install cuDF on Colab. Run a groupby on a 10M row DataFrame. Compare to pandas timing.
- Deploy a 7B model with vLLM on a rented GPU (Lambda Labs A10G = $0.75/hr). Benchmark tokens/sec.
- Enable INT8 quantization in vLLM (
--quantization awq). Compare memory usage and throughput vs BF16. - Cross-reference: read the PyTorch Refresher for training patterns, LLMs Refresher for inference architecture.
- Deliverable: vLLM server running, OpenAI-compatible endpoint tested, performance numbers documented.
Week 4: Multi-GPU and production readiness
- Run a DDP (DistributedDataParallel) training job on 2 GPUs using
torchrun. Measure linear scaling. - Explore TensorRT: export a PyTorch model, build a TRT engine, benchmark vs PyTorch eager mode.
- Cost exercise: calculate the break-even for GPU vs CPU for your most compute-intensive work task.
- Study the "When NOT to Use" section critically: identify 3 things in your current stack where someone might cargo-cult GPU and why it wouldn't help.
- Deliverable: Written analysis: "Should we use GPU for [X at your job]?" with numbers.
Resources
| Resource | Format | Best for |
|---|---|---|
| NVIDIA DLI: Fundamentals of Accelerated Computing | Interactive course (~8 hrs) | Hands-on CUDA C from scratch |
| Programming Massively Parallel Processors (Kirk & Hwu) | Textbook | Deep architecture understanding |
| fast.ai Practical Deep Learning | Free course | Applied ML with GPU, top-down teaching style |
| Andrej Karpathy: Zero to Hero | YouTube series | Build GPT from scratch, includes GPU training |
| Triton Tutorials | Code + docs | Write custom GPU kernels in Python (no CUDA C) |
| PyTorch Official Tutorials | Interactive | PyTorch fundamentals with GPU examples |
| NVIDIA GTC recordings (YouTube) | Talks (30–60 min each) | Latest techniques, architecture announcements |
.to("cuda") to a PyTorch training loop. Start with the practical use cases (Section 4–8) that apply to your current work. Come back to the architecture sections when you hit performance problems and need to diagnose why.