Accelerated Computing Refresher — GPUs, CUDA, NVIDIA Stack

Table of Contents

1. Why Accelerated Computing?

For decades, your code got faster for free. Every 18 months, the new CPU was 2x as fast. You could ship slower code and rely on hardware to bail you out. That era ended around 2005.

CPU clock speeds hit a wall at roughly 4 GHz. Two fundamental limits collide: power density (faster clocks burn more power, which melts chips) and Dennard scaling (the property that let transistors run cooler as they shrank stopped working at small node sizes). Intel's "4 GHz barrier" from 2004 is still, in 2026, a barrier.

The answer the industry landed on: do many things at once instead of one thing faster. CPUs added more cores (2, 4, 8, 64). But GPUs took this to an extreme: thousands of simpler cores working in parallel.

The core insight: latency vs. throughput

CPUs are optimized for latency — getting one task done as fast as possible. They have large caches, branch predictors, out-of-order execution units, and speculative execution — all machinery that reduces how long a single thread waits.

GPUs are optimized for throughput — doing as many tasks simultaneously as possible. They sacrifice single-thread speed for the ability to run thousands of threads at once, all operating on different data.

Think of it this way: a CPU is a few elite sprinters. A GPU is a stadium of ordinary runners — slower individually, but together they cover vastly more ground.

CPU vs GPU: Die Area Usage ┌─────────────────────────────────────────────────────────┐ │ CPU (e.g., Intel Core i9) GPU (e.g., H100 SXM5) │ │ │ │ ┌──────┐ ┌──────┐ ┌──────────────────┐ │ │ │Core 0│ │Core 1│ │ ████ ████ ████ │ │ │ │ L1/L2│ │ L1/L2│ │ ████ ████ ████ │ │ │ └──────┘ └──────┘ │ ████ ████ ████ │ │ │ ┌──────┐ ┌──────┐ │ ████ ████ ████ │ │ │ │Core 2│ │Core 3│ Large L3 → │ ████ ████ ████ │ │ │ │ L1/L2│ │ L1/L2│ ┌────────┐ │ ████ ████ ████ │ │ │ └──────┘ └──────┘ │ Cache │ │ ████ ████ ████ │ │ │ ┌──────┐ ┌──────┐ │ (64MB) │ │ ████ ████ ████ │ │ │ │Core 4│ │Core 5│ └────────┘ │ (16,896 cores) │ │ │ └──────┘ └──────┘ └──────────────────┘ │ │ │ │ Most die area: cache + logic Most die area: COMPUTE │ │ 8-64 fast, complex cores Thousands simple cores │ │ Branch predict, out-of-order In-order, SIMT │ └─────────────────────────────────────────────────────────┘

CPU vs GPU at a glance

Property	Modern CPU (e.g., AMD EPYC)	Modern GPU (e.g., H100 SXM5)
Core count	8–128 cores	2,000–20,000 CUDA cores
Clock speed	3–5 GHz	1–2 GHz
Memory bandwidth	50–300 GB/s (DDR5)	500–3,350 GB/s (HBM3e)
Memory capacity	Up to 6 TB (server)	24–192 GB (VRAM)
Peak FLOPS	~2 TFLOPS (FP32)	~67 TFLOPS (FP32 CUDA cores), 989 TFLOPS (TF32 Tensor Cores) — H100 SXM5
Best for	Sequential, branchy, latency-sensitive	Parallel, data-parallel, throughput
Programmed with	Any language	CUDA/HIP/SYCL or high-level frameworks

Jensen Huang, NVIDIA's CEO, declared at GTC 2026: "All computing will be accelerated computing." Whether or not you believe the absolutism, the direction is clear. ML training, data pipelines, LLM inference, vector search — workloads that used to be CPU-bound are routinely 10–100x faster on GPU today.

The practical implication

If your job involves data processing, machine learning, LLM serving, scientific simulation, or real-time video — GPU skills are becoming table-stakes. This page explains what you actually need to know to start using them.

2. GPU vs CPU Architecture

You know CPUs. Let's build GPU understanding from that foundation.

What makes a CPU fast for your typical code

When you call a function that does an if/else based on a database result, then calls another function based on that result — this is a control-flow-heavy, data-dependent workload. CPUs are engineered for exactly this pattern:

Branch predictor: guesses which branch you'll take, starts executing it before the condition is known. 95%+ accurate on typical code.
Out-of-order execution: reorders instructions to keep execution units busy while waiting on memory.
Large L1/L2/L3 caches: a single CPU core might have 256KB L1 + 1MB L2 + 64MB L3. Getting data fast for the thread that needs it now.
Speculative execution: runs code ahead of where you logically are, rolls back if wrong.

All of this silicon is in service of making one thread run fast.

What a GPU is actually doing

A GPU is built around a different assumption: you have a regular, data-parallel problem. The same operation applied to millions of independent data points. "Add these two arrays." "Multiply these matrices." "Apply this activation function to these 65,536 values."

GPUs strip out most of the complexity that makes CPUs fast at sequential code, and put that transistor budget into more compute units instead.

Key GPU architectural concepts you need to know:

Streaming Multiprocessors (SMs)

An H100 has 132 SMs. Each SM is something like a mini-CPU with its own set of CUDA cores, registers, shared memory, and schedulers. Think of them as departments in a very large company — each runs somewhat independently.

SIMT: Single Instruction, Multiple Threads

Within an SM, threads are grouped into warps of 32. All 32 threads in a warp execute the same instruction at the same time, but each on different data. This is SIMT: Single Instruction, Multiple Threads.

If thread 0 is adding element 0 of array A to element 0 of array B, thread 1 is simultaneously adding element 1 to element 1, ..., thread 31 is adding element 31. One instruction, 32 results produced in parallel.

Warp divergence: why branching hurts on GPUs

This is the single most important thing to understand about GPU performance.

Since 32 threads share an instruction, what happens when some threads need to take an if branch and others need to take the else? Both paths execute. The threads that aren't supposed to be on a given path are masked off (their results are discarded), but the time is spent. You get worst-case performance: sequential execution of both paths, zero parallelism.

Warp divergence — the GPU performance killer

Code with many if/else branches where different threads take different paths is a bad fit for GPUs. The 32 threads in a warp must execute the same instruction. Divergence means serialization.

Memory hierarchy: bandwidth over latency

GPUs have a very different memory structure than CPUs:

GPU Memory Hierarchy (H100 example) Latency Bandwidth ┌─────────────────────────────────────────────────────────────────┐ │ Registers (per-thread) ~1 cycle ~20 TB/s │ │ ┌───────────────────────────────┐ │ │ │ Thread 0 │ Thread 1 │ ... │ Each thread: ~64K regs │ │ └───────────────────────────────┘ │ │ │ │ Shared Memory / L1 Cache (per-SM) ~5 cycles ~12 TB/s │ │ ┌───────────────────────────────┐ │ │ │ Shared across warp/block │ ~96 KB configurable │ │ └───────────────────────────────┘ (your most important knob) │ │ │ │ L2 Cache (chip-wide) ~40 cycles ~12 TB/s │ │ ┌───────────────────────────────┐ │ │ │ All SMs share this │ 50–60 MB │ │ └───────────────────────────────┘ │ │ │ │ HBM (Global Memory / VRAM) ~600 cycles 3.35 TB/s│ │ ┌───────────────────────────────┐ │ │ │ Main GPU memory, all threads │ 80 GB (H100 SXM5) │ │ └───────────────────────────────┘ │ │ │ │ System RAM (host memory, via PCIe) ~32 GB/s │ │ ┌───────────────────────────────┐ │ │ │ CPU's memory, must transfer │ ~1 TB (server) │ │ └───────────────────────────────┘ ← the bottleneck to watch │ └─────────────────────────────────────────────────────────────────┘

The key takeaway: GPU global memory (HBM) has enormous bandwidth — 3.35 TB/s on an H100. Your CPU can barely manage 200 GB/s on a high-end server. This is why matrix multiplications are so fast on GPU: they need to read a lot of data quickly, and HBM delivers it.

But the CPU-to-GPU transfer over PCIe is slow (~32 GB/s). Every time you move data from RAM to VRAM or back, you pay this cost. It's the primary reason small datasets don't benefit from GPU acceleration: the transfer overhead exceeds the compute savings.

Property	CPU Core (Zen 4)	GPU SM (H100)
Per-core clock	~5 GHz	~1.8 GHz
L1 cache	32–64 KB	256 KB (shared/L1)
Branch prediction	Yes, very good	None (warps)
Out-of-order exec	Yes	No
Threads per unit	1 (+ hyperthreading)	2,048 concurrent threads per SM
Designed for	Low-latency, single-thread speed	High-throughput, data-parallel work

3. The CUDA Programming Model

You almost certainly will never write raw CUDA code in your career. But understanding the mental model is what lets you reason about performance, debug memory errors, and pick the right tool for your problem. Think of this as reading a map before you drive.

The execution hierarchy: Grid → Block → Thread → Warp

When you launch a GPU computation, you're launching a kernel: a function that executes on the GPU, run by thousands of threads simultaneously.

Those threads are organized in a hierarchy:

CUDA Execution Hierarchy ┌─────────────────────────────────────────┐ │ GRID (one kernel launch) │ │ ┌────────────────────────────────┐ │ │ │ Block (0,0) │ Block (1,0) │ ... │ │ │ ┌──────────┐ │ ┌──────────┐ │ │ │ │ │ Thread 0 │ │ │ Thread 0 │ │ │ │ │ │ Thread 1 │ │ │ Thread 1 │ │ │ │ │ │ Thread 2 │ │ │ Thread 2 │ │ │ │ │ │ ... │ │ │ ... │ │ │ │ │ │ Thread 31│ │ │ Thread 31│ │ │ │ │ └──────────┘ │ └──────────┘ │ │ │ │ (= 1 warp) │ (= 1 warp) │ │ │ └────────────────────────────────┘ │ │ │ │ Blocks map to SMs (GPU departments) │ │ Warps (32 threads) execute together │ └─────────────────────────────────────────┘

Thread: the basic unit of work. Each thread runs your kernel function once, with a unique ID.
Block: a group of up to 1,024 threads. Threads in the same block can share memory and synchronize with each other.
Grid: all the blocks launched by a single kernel call. Can be millions of blocks.
Warp: 32 threads that physically execute together (hardware concept, not something you configure).

A concrete example: vector addition

The "hello world" of GPU programming. Add two arrays, element by element. On a CPU, you'd write a loop. On a GPU, each thread handles one addition:

// CUDA C++ kernel — each thread adds one element
__global__ void vector_add(float *a, float *b, float *result, int n) {
    // Figure out which element this thread handles
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Guard: some blocks may be launched with threads beyond array end
    if (idx < n) {
        result[idx] = a[idx] + b[idx];
    }
}

// Host code: launch the kernel
int main() {
    int n = 1000000;  // 1 million elements
    float *d_a, *d_b, *d_result;
    // ... allocate and populate h_a, h_b, h_result on host (CPU) side ...

    // Allocate GPU memory
    cudaMalloc(&d_a, n * sizeof(float));
    cudaMalloc(&d_b, n * sizeof(float));
    cudaMalloc(&d_result, n * sizeof(float));

    // Copy data from CPU to GPU
    cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);

    // Launch: 256 threads per block, enough blocks to cover all n elements
    int threads = 256;
    int blocks = (n + threads - 1) / threads;  // = 3907 blocks
    vector_add<<>>(d_a, d_b, d_result, n);

    // Copy results back to CPU
    cudaMemcpy(h_result, d_result, n * sizeof(float), cudaMemcpyDeviceToHost);

    // 3,907 blocks × 256 threads = ~1M threads ran simultaneously
    return 0;
}

Three phases: load (copy data to GPU), launch (run kernel), read back (copy results to CPU). This pattern is universal. The transfer cost is why small datasets often don't benefit.

Thread indexing: how threads know what to work on

Every thread can read three built-in variables to figure out which piece of data it should process:

threadIdx.x: thread's position within its block (0 to blockDim.x-1)
blockIdx.x: which block this thread is in
blockDim.x: how many threads are in each block

The global element index: int idx = blockIdx.x * blockDim.x + threadIdx.x. Block 0 threads handle elements 0–255, block 1 threads handle 256–511, etc.

You don't write CUDA to use GPUs in practice

Libraries like PyTorch, cuDF, and CuPy have already written the CUDA kernels for you. Understanding the model helps you reason about performance, but you won't be writing __global__ functions in production ML or data work.

4. "My Pandas Code Takes 20 Minutes" — GPU for Data

This is the most accessible GPU use case for most software engineers. No ML knowledge required. Your existing pandas, Spark, or Polars code can often run dramatically faster on a GPU with minimal changes.

cudf.pandas: zero code changes, 10–150x faster

NVIDIA's cuDF library implements the pandas API on the GPU. The cudf.pandas accelerator is a drop-in: add one line at the top of your script, and pandas operations run on GPU transparently.

# Before: vanilla pandas, 20+ minutes on 10GB CSV
import pandas as pd

df = pd.read_csv("transactions_10gb.csv")
result = (
    df.groupby(["customer_id", "product_category"])
    .agg({"amount": ["sum", "mean", "count"]})
    .reset_index()
    .merge(df[["customer_id", "region"]].drop_duplicates(), on="customer_id")
)
result.to_csv("output.csv", index=False)

# After: cudf.pandas accelerator — identical code, GPU execution
import cudf.pandas  # ← the only change
cudf.pandas.install()

import pandas as pd  # pandas is now GPU-accelerated

df = pd.read_csv("transactions_10gb.csv")
result = (
    df.groupby(["customer_id", "product_category"])
    .agg({"amount": ["sum", "mean", "count"]})
    .reset_index()
    .merge(df[["customer_id", "region"]].drop_duplicates(), on="customer_id")
)
result.to_csv("output.csv", index=False)
# Runtime: ~25 seconds instead of 23 minutes (55x faster on T4 GPU)

# Install
pip install cudf-cu12  # for CUDA 12.x
# Or on Colab: !pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com

When cudf.pandas falls back to CPU

Not every pandas operation has a GPU implementation. cudf.pandas automatically falls back to CPU for unsupported operations, so your code never breaks — it just won't always be GPU-accelerated. Check with:

import cudf.pandas
cudf.pandas.install()
# At end of script, print what ran on GPU vs CPU:
cudf.pandas.profile()

Spark RAPIDS: 5x faster Spark, no code changes

If you're using Apache Spark for large-scale data processing, RAPIDS Accelerator for Spark is a plugin that runs Spark SQL operations on GPUs. It intercepts physical plan nodes and replaces CPU executors with GPU equivalents — zero code changes required, just a config flag.

# Add to spark-submit
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.enabled=true \
--conf spark.executor.resource.gpu.amount=1

# Or in PySpark:
spark = SparkSession.builder \
    .config("spark.plugins", "com.nvidia.spark.SQLPlugin") \
    .config("spark.rapids.sql.enabled", "true") \
    .getOrCreate()

Benchmark: a typical ETL pipeline on 1TB of Parquet — Spark on CPU: 3.5 hours. Spark RAPIDS on 8x A100: 42 minutes. At cloud spot pricing, GPU actually costs less due to shorter runtime.

Polars GPU backend

# Polars 1.x with GPU backend
# Install (run in terminal or Jupyter cell with ! prefix):
#   pip install polars[gpu]   # installs cuDF dependency

import polars as pl

# Enable GPU engine
df = pl.scan_parquet("data/*.parquet")
result = (
    df.filter(pl.col("amount") > 100)
    .group_by("category")
    .agg(pl.col("amount").sum())
    .collect(engine="gpu")  # ← run on GPU
)

Benchmarks: when GPU data processing pays off

Operation	Data Size	CPU Time	GPU Time (T4)	Speedup
groupby + agg	1 GB	45s	2s	22x
join two tables	5 GB + 500 MB	3.5 min	8s	26x
sort + dedup	2 GB	90s	4s	22x
string ops (ILIKE)	500 MB	25s	12s	2x (strings are hard)
custom Python lambda	Any	baseline	baseline (fallback)	1x (falls back to CPU)

When GPU data processing doesn't help

Data under ~100 MB: CPU+RAM is fast enough; transfer overhead dominates
I/O-bound pipelines: if you're waiting on network/disk, the GPU sits idle
Custom Python lambdas: df.apply(my_python_func) can't run on GPU — it needs Python interpreter per row
Streaming row-by-row: GPU excels at batch operations, not one-record-at-a-time processing

For more on cuDF and data transform tools, see the Data Transformation Refresher.

5. "Training Takes 3 Days" — GPU for ML

This is the bread-and-butter GPU use case. Training a neural network is fundamentally matrix multiplication repeated millions of times — exactly what GPUs are built for.

The two-line GPU upgrade in PyTorch

import torch
import torch.nn as nn

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")  # "cuda" on Colab with GPU runtime

# Define a model
model = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(),
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

# Move model to GPU — this copies all parameters to VRAM
model = model.to(device)

# Training loop
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    for batch_x, batch_y in dataloader:
        # Move data to same device as model
        batch_x = batch_x.to(device)  # ← CPU→GPU transfer happens here
        batch_y = batch_y.to(device)

        optimizer.zero_grad()
        output = model(batch_x)       # forward pass on GPU
        loss = criterion(output, batch_y)
        loss.backward()               # backward pass on GPU
        optimizer.step()

That's it for a single GPU. model.to("cuda") and data.to("cuda") — two patterns, one rule. All computation between those .to(device) calls runs on the GPU.

Mixed precision training: 2x faster, half the memory

By default, PyTorch uses FP32 (32-bit floats). Switching to FP16 or BF16 halves memory usage and roughly doubles throughput on modern GPUs (which have Tensor Cores designed for FP16/BF16 matmul). The torch.amp (automatic mixed precision) module handles this automatically:

from torch.amp import autocast, GradScaler

scaler = GradScaler("cuda")  # handles FP16 gradient scaling (prevents underflow)

for batch_x, batch_y in dataloader:
    batch_x, batch_y = batch_x.to(device), batch_y.to(device)
    optimizer.zero_grad()

    # Forward pass in FP16 where safe, FP32 where precision matters
    with autocast("cuda"):
        output = model(batch_x)
        loss = criterion(output, batch_y)

    # Backward pass with scaled gradients
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

On an A100, BF16 training is typically 2–3x faster than FP32 with zero accuracy loss. On consumer GPUs (RTX 4090), FP16 with AMP gives similar benefits.

Multi-GPU training: DistributedDataParallel

import os
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Launch with: torchrun --nproc_per_node=4 train.py
dist.init_process_group(backend="nccl")  # NCCL is the GPU comms library
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device(f"cuda:{local_rank}")

model = MyModel().to(device)
model = DDP(model, device_ids=[local_rank])  # wraps model for multi-GPU

# Each GPU processes a different shard of data
# DDP automatically averages gradients across all GPUs after backward()
# With 4xA100: effectively 4x larger batch, ~3.5x faster training

Training time comparison: CPU vs GPU

Workload	CPU (32 cores)	Single T4	Single A100	8x A100
ResNet-50, 1 epoch (ImageNet)	~6 hours	~22 min	~8 min	~1 min
BERT-base fine-tune (MNLI)	~18 hours	~45 min	~15 min	~2 min
GPT-2 (117M) from scratch	days	~8 hours	~2.5 hours	~20 min

For more on PyTorch training patterns, see the PyTorch Refresher. For foundational ML concepts (loss functions, overfitting, regularization), see the Machine Learning Refresher.

6. "Serve 1000 LLM Requests/Second" — GPU for Inference

The hottest GPU use case in 2026. Serving large language models at scale is almost impossible without GPUs. Here's why, and how to do it.

Why LLMs need GPUs

A 7B parameter model has 7 billion floating-point numbers. At FP16 (2 bytes each), that's 14 GB just to store the weights — before you handle any requests. Each inference request involves multiplying these weights by the input embeddings repeatedly across 32+ transformer layers. That's billions of multiply-add operations per request.

On a CPU, generating one token for one user takes ~500ms. On a single A10G GPU, the same computation takes ~5ms — and the GPU can batch 50 concurrent users with minimal throughput penalty, making the effective throughput 100x higher than CPU.

vLLM: the standard for LLM serving

vLLM (Virtual LLM) is the dominant open-source LLM inference server. Its key innovation is PagedAttention: memory management for KV cache that allows batching requests without wasting VRAM on fixed-size allocations.

# Install and start serving Llama 3.1 8B
pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192 \
    --dtype bfloat16
# Server starts on http://localhost:8000
# Compatible with OpenAI API format

# Call the vLLM server (OpenAI-compatible)
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain CUDA in one paragraph."}],
    max_tokens=200,
)
print(response.choices[0].message.content)

# vLLM Python API (in-process, for batch jobs)
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct", dtype="bfloat16")
params = SamplingParams(temperature=0.7, max_tokens=200)

# Send 1000 prompts at once — vLLM batches them automatically
prompts = [f"Summarize: {text}" for text in texts]
outputs = llm.generate(prompts, params)

for output in outputs:
    print(output.outputs[0].text)

TensorRT-LLM: maximum throughput

NVIDIA TensorRT-LLM compiles your model into an optimized inference engine using kernel fusion, quantization, and custom CUDA kernels. It's more complex to set up than vLLM but delivers higher peak throughput — 2–5x over naive PyTorch inference.

# TensorRT-LLM via Docker (recommended setup)
docker pull nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3

# Convert Llama 3.1 8B to TRT engine
python examples/llama/convert_checkpoint.py \
    --model_dir /models/llama-3.1-8b \
    --output_dir /tmp/trt-llama-ckpt \
    --dtype bfloat16

trtllm-build \
    --checkpoint_dir /tmp/trt-llama-ckpt \
    --output_dir /tmp/trt-llama-engine \
    --gemm_plugin bfloat16 \
    --max_batch_size 64 \
    --max_input_len 2048 \
    --max_output_len 512

Cost math: GPU vs CPU for LLM inference

Setup	Throughput (tokens/sec)	Cost/hr	Cost per 1M tokens
CPU (64-core server, Llama 3.1 8B)	~50 tok/s	$2.40	$13.33
T4 GPU (16 GB), vLLM	~1,200 tok/s	$0.53	$0.12
A10G GPU (24 GB), vLLM	~3,500 tok/s	$1.06	$0.08
A100 GPU (80 GB), TRT-LLM	~8,000 tok/s	$3.20	$0.11

CPU inference for LLMs is roughly 25–100x more expensive per token than GPU inference. For any production serving load above a few requests per day, GPU pays for itself quickly.

NVIDIA NIM: pre-packaged inference containers

NVIDIA NIM (NVIDIA Inference Microservices) packages optimized models into ready-to-deploy containers. You pull a container, it downloads the model and starts serving. No manual TensorRT-LLM setup required.

# Pull and run a NIM container for Llama 3.1 8B
export NGC_API_KEY=your_key_here
docker run -it --rm --gpus all \
    -v ~/.cache/nim:/opt/nim/.cache \
    -p 8000:8000 \
    -e NGC_API_KEY \
    nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3

# Server is now running, OpenAI-compatible at port 8000
curl http://localhost:8000/v1/models

For more on LLMs and transformers, see the LLMs Refresher.

7. "Search 1M Vectors in 50ms" — GPU for Vector Search

If you're building RAG systems, semantic search, or recommendation engines, you're doing approximate nearest-neighbor (ANN) search over large vector collections. GPUs accelerate both index building and search dramatically.

CPU Faiss vs GPU Faiss

Meta's Faiss library is the standard for vector search. It has a GPU backend that can be plugged in with minimal code changes:

import numpy as np
import faiss

# Generate 1M vectors, 768 dimensions (typical embedding size)
d = 768
n = 1_000_000
vectors = np.random.randn(n, d).astype("float32")

# --- CPU version ---
index_cpu = faiss.IndexFlatL2(d)
index_cpu.add(vectors)                  # ~45 seconds
distances, indices = index_cpu.search(  # ~8 seconds for 1000 queries
    query_vectors, k=10
)

# --- GPU version ---
res = faiss.StandardGpuResources()
# Copies the already-populated CPU index to GPU — vectors are already added
index_gpu = faiss.index_cpu_to_gpu(res, 0, index_cpu)  # moves to GPU 0
distances, indices = index_gpu.search(  # ~1.0 seconds (8x faster than CPU)
    query_vectors, k=10
)

cuVS: NVIDIA's next-gen vector search

cuVS (CUDA Vector Search) is NVIDIA's dedicated library for GPU-accelerated approximate nearest-neighbor search, now the recommended backend for large-scale deployments:

from cuvs.neighbors import cagra
import cupy as cp

# Move vectors to GPU
gpu_vectors = cp.asarray(vectors)

# Build CAGRA index (graph-based ANN, ~10x faster than CPU HNSW)
index_params = cagra.IndexParams(metric="sqeuclidean")
index = cagra.build(index_params, gpu_vectors)

# Search
search_params = cagra.SearchParams(itopk_size=64)
distances, neighbors = cagra.search(
    search_params, index, query_gpu, k=10
)
# Results back as cupy arrays — stay on GPU for downstream processing

When GPU vector search makes sense

Scenario	Recommendation
Under 100K vectors	CPU Faiss or pgvector is fast enough. Don't bother.
100K–10M vectors, latency <100ms	GPU Faiss or cuVS pays off. Significant speedup.
Batch indexing (millions of new vectors/day)	GPU indexing is 5–10x faster. Reduces index rebuild time.
Real-time RAG (<50ms p99)	GPU search can achieve this; CPU struggles at scale.
Already have GPU for inference	Run vector search on same GPU — zero extra cost.

A practical pattern: if you're already running vLLM on a GPU for LLM inference, run your vector search on the same GPU during idle inference time. GPU utilization becomes near-100% and you get both for the price of one.

8. "Real-Time Object Detection at 30 FPS" — GPU for Vision

Computer vision workloads — object detection, image classification, video analysis — were the original GPU killer app and remain a dominant use case.

YOLOv8 on GPU: 30 FPS vs 2 FPS

from ultralytics import YOLO
import cv2

# Load YOLOv8 nano model (fastest)
model = YOLO("yolov8n.pt")

# Run on GPU
cap = cv2.VideoCapture("video.mp4")

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # model auto-detects GPU and uses it
    results = model(frame, device="cuda:0", verbose=False)

    # Draw bounding boxes
    annotated = results[0].plot()
    cv2.imshow("Detection", annotated)

    # GPU: ~30 FPS at 1080p with yolov8n
    # CPU: ~2 FPS at 1080p with yolov8n (15x slower)
    if cv2.waitKey(1) == ord("q"):
        break

cap.release()

# Install
pip install ultralytics  # includes PyTorch + CUDA dependencies

# Benchmark
yolo benchmark model=yolov8n.pt imgsz=640 device=0  # GPU
yolo benchmark model=yolov8n.pt imgsz=640 device=cpu  # CPU

Video pipeline: full GPU processing

For production video pipelines, you want everything on the GPU — decode, preprocess, inference, and encode — to avoid the PCIe bottleneck between steps:

import torch
import torchvision.transforms as T
from torchvision.io import read_video  # CPU video decode, then transfer to GPU

# Decode on CPU, then move to GPU tensor
# Requires: pip install torchvision[video]
frames, _, _ = read_video("clip.mp4", output_format="TCHW")
frames = frames.to("cuda").float() / 255.0  # (T, C, H, W) on GPU

# Batch preprocess on GPU
transform = T.Compose([
    T.Resize((640, 640)),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Process all frames as a batch — no loop needed
frames_preprocessed = transform(frames)  # GPU, vectorized

# Run inference on all frames at once
with torch.no_grad():
    with torch.amp.autocast("cuda"):  # PyTorch 2.0+ API
        predictions = model(frames_preprocessed)

GPU vision use cases by industry

Industry	Workload	Why GPU
Autonomous vehicles	Multi-camera real-time detection	30 FPS × 8 cameras × 10 models simultaneously
Medical imaging	3D MRI/CT segmentation	Large 3D volumes, batch processing overnight
Retail analytics	Store camera foot-traffic	50+ camera streams, on-prem edge GPU
Satellite imagery	Change detection over regions	Terabytes of images, batch classification
Manufacturing QA	Defect detection on assembly line	Sub-10ms latency requirement, GPU achieves 2ms

9. "Monte Carlo on a Million Paths" — GPU for Scientific Computing

Scientific and quantitative workloads — simulations, numerical methods, financial modeling — often map cleanly to GPU architecture because they involve the same operation applied to many independent data points.

CuPy: NumPy on GPU, drop-in replacement

import numpy as np
import cupy as cp  # pip install cupy-cuda12x
import time

# CPU version
x_cpu = np.random.randn(10_000_000).astype(np.float32)
start = time.time()
result_cpu = np.fft.fft(x_cpu)
print(f"CPU FFT: {time.time() - start:.3f}s")  # ~0.8s

# GPU version — identical code, just cp instead of np
x_gpu = cp.asarray(x_cpu)  # copy to GPU once
start = time.time()
result_gpu = cp.fft.fft(x_gpu)  # GPU computation
cp.cuda.Stream.null.synchronize()
print(f"GPU FFT: {time.time() - start:.3f}s")  # ~0.012s (66x faster)

# CuPy array behaves like numpy array
print(result_gpu.shape)  # (10000000,)
print(type(result_gpu))  # cupy.ndarray
result_back = cp.asnumpy(result_gpu)  # back to CPU if needed

Monte Carlo option pricing: 100x faster

Monte Carlo simulation is a textbook GPU workload: simulate millions of independent random paths, then aggregate. Perfect SIMT fit — no dependencies between paths.

import cupy as cp

def monte_carlo_option_gpu(S0, K, r, sigma, T, n_paths=1_000_000, n_steps=252):
    """
    European call option pricing via Monte Carlo.
    S0: initial stock price
    K: strike price
    r: risk-free rate
    sigma: volatility
    T: time to expiry (years)
    n_paths: number of simulation paths
    n_steps: daily steps
    """
    dt = T / n_steps

    # Generate all random numbers at once on GPU
    # Shape: (n_paths, n_steps) — each row is one simulation path
    Z = cp.random.standard_normal((n_paths, n_steps), dtype=cp.float32)

    # Compute all paths simultaneously (vectorized, no Python loop)
    log_returns = (r - 0.5 * sigma**2) * dt + sigma * cp.sqrt(dt) * Z
    # Cumulative sum along time dimension = log price path
    log_prices = cp.log(S0) + cp.cumsum(log_returns, axis=1)
    final_prices = cp.exp(log_prices[:, -1])

    # Payoff for each path
    payoffs = cp.maximum(final_prices - K, 0.0)

    # Discount and average
    option_price = cp.exp(-r * T) * cp.mean(payoffs)

    return float(option_price)

# Price an option: S=100, K=105, r=5%, sigma=20%, T=1yr
price = monte_carlo_option_gpu(100, 105, 0.05, 0.20, 1.0)
print(f"Option price: ${price:.4f}")

# Timing (A100):
# GPU (1M paths): ~0.03 seconds
# CPU (NumPy, 1M paths): ~3.1 seconds (~100x faster on GPU)

Other scientific domains benefiting from GPU

Domain	Workload	GPU library	Typical speedup
Molecular dynamics	Protein folding simulations	GROMACS, OpenMM	50–100x
Weather modeling	Atmospheric simulation	CUDA Fortran, cuSPARSE	10–30x
Computational fluid dynamics	Flow simulation, FEM	CUDA, AmgX	20–50x
Seismic processing	Subsurface imaging	cuFFT, custom CUDA	30–100x
Financial risk	VaR, CVA, Monte Carlo	CuPy, custom CUDA	50–200x
Genomics	Sequence alignment (PARABRICKS)	NVIDIA Clara Parabricks	50x (hours → minutes)

When CPU wins for scientific computing

Sequential simulations (where step N depends on step N-1), small-scale problems (<100K elements), or I/O-bound workflows (reading from disk between every step) won't benefit. A climate model that's one big coupled ODE system is harder to GPU-accelerate than a Monte Carlo with independent paths.

10. NVIDIA's Full Stack

NVIDIA is unusual in that it sells not just hardware but a tightly integrated hardware-to-application stack. Understanding the layers helps you know where your code sits and which NVIDIA tools are relevant to you.

NVIDIA Stack: Hardware to Application ┌─────────────────────────────────────────────────────────────────────┐ │ APPLICATIONS │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────────────┐ │ │ │ Your ML │ │ vLLM / │ │ RAPIDS │ │ Omniverse / │ │ │ │ training │ │ NIM │ │ pipeline │ │ simulation │ │ │ └────────────┘ └────────────┘ └────────────┘ └──────────────────┘ │ │ │ │ FRAMEWORKS │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────────────┐ │ │ │ PyTorch │ │ TensorRT │ │ RAPIDS │ │ Triton (lang) │ │ │ │ (cuDNN) │ │ -LLM │ │ (cuDF…) │ │ OpenAI kernel │ │ │ └────────────┘ └────────────┘ └────────────┘ └──────────────────┘ │ │ │ │ LIBRARIES (CUDA ecosystem, 20+ years of optimized kernels) │ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌───────┐ │ │ │cuDNN │ │cuBLAS│ │cuFFT │ │ NCCL │ │cuSPAR│ │cuRand│ │cuSolver│ │ │ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └───────┘ │ │ │ │ CUDA (the foundation — C/C++/Python, CUDA C, PTX) │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ CUDA Runtime + Driver API + nvcc compiler + tooling │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ HARDWARE │ │ ┌────────────────┐ ┌─────────────┐ ┌───────────┐ ┌─────────────┐ │ │ │ H100/H200/B200│ │ RTX 4090/ │ │ Jetson │ │ DGX / HGX │ │ │ │ (data center) │ │ 5090 │ │ (edge) │ │ (clusters) │ │ │ └────────────────┘ └─────────────┘ └───────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────────────────────┘

Hardware: the GPU lineup

GPU	Tier	VRAM	FP16 TFLOPS	Memory BW	Best use case
H100 SXM5 Current flagship	Data center	80 GB HBM3	989	3.35 TB/s	LLM training/inference, large-scale ML
B200 2025	Data center	192 GB HBM3e	2,250	8 TB/s	Next-gen LLM, future workloads
A100	Data center	40/80 GB HBM2e	312 *	2 TB/s	Most production ML today
L4	Data center (inference)	24 GB GDDR6	242	300 GB/s	Efficient inference, video transcoding
RTX 4090	Desktop/workstation	24 GB GDDR6X	330	1 TB/s	Research, local LLM dev, gaming
T4	Data center (budget)	16 GB GDDR6	130	320 GB/s	Free Colab tier, affordable inference
Jetson Orin	Edge	64 GB unified	275 (INT8)	204 GB/s	Autonomous vehicles, robotics, IoT

* A100 312 TFLOPS FP16 figure is with 2:4 structured sparsity enabled. Dense FP16 throughput is ~77 TFLOPS. Sparsity requires the model's weight matrices to be pruned to the 2:4 sparse pattern before benefiting.

NVLink and NVSwitch: multi-GPU fabric

PCIe (the slot that connects GPU to CPU) runs at ~64 GB/s bidirectional. That's too slow when 8 GPUs need to share gradients during training. NVLink is a direct GPU-to-GPU interconnect; NVLink 4.0 provides 900 GB/s total per GPU (18 links × 50 GB/s bidirectional each). NVSwitch is a chip that creates an all-to-all mesh — any GPU can talk to any GPU at full NVLink bandwidth. An HGX H100 system (8 GPUs + NVSwitches) has 7.2 TB/s of GPU-to-GPU bandwidth, making it effectively one logical compute unit.

CUDA: the foundation

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and API, first released in 2007. It's what makes GPUs programmable for general computation (not just graphics). Everything else in the stack runs on top of CUDA.

Key insight: you almost never write CUDA directly. PyTorch, cuDF, and TensorRT have already written thousands of highly optimized CUDA kernels. You use their Python/C++ APIs and get GPU acceleration automatically.

Key libraries you'll encounter

Library	What it does	Used by
cuDNN	Optimized primitives for deep learning (conv, pool, norm)	PyTorch, TensorFlow
cuBLAS	GPU-accelerated BLAS (matrix multiply at its core)	Everything that does linear algebra
cuFFT	Fast Fourier Transform on GPU	Signal processing, audio ML
NCCL	GPU collective comms (all-reduce, broadcast)	Multi-GPU/multi-node training
cuSPARSE	Sparse matrix operations on GPU	GNNs, sparse models, scientific sim
Thrust	GPU parallel algorithms (sort, reduce, scan)	cuDF under the hood
Triton (OpenAI)	Python-based custom GPU kernel language	Flash Attention, custom ops

11. When NOT to Use a GPU Critical

GPU cargo-culting is real. I've seen engineers reach for GPU acceleration for web scraping, small CSV processing, and microservices with 10 req/s — and make their system slower, more expensive, and harder to operate. Here's when to stay on CPU.

Not every problem needs a GPU

GPUs introduce latency (kernel launch overhead), complexity (VRAM management, driver versions), and cost. Only use them when the throughput gain justifies these trade-offs.

1. I/O-bound workloads

If your bottleneck is network latency, database queries, or disk reads, the GPU sits idle 95% of the time. You're paying for a Ferrari to wait at red lights.

Example: A web scraper that fetches 100 URLs, extracts text, calls an external API, and saves results. Bottleneck: network I/O (100–500ms per URL). GPU utilization: ~0%.

Test: Run nvidia-smi while your job is running. If GPU utilization is below 50% most of the time, you're I/O-bound.

2. Small datasets

The CPU-to-GPU PCIe transfer costs real time. For a 10 MB DataFrame, the transfer itself takes ~0.3ms. The GPU computation might take 1ms. But loading the same 10 MB from L3 cache into CPU cores takes ~0.01ms, and the computation might take 5ms. GPU total: 1.3ms. CPU total: 5ms. At this scale, GPU wins — but barely, and only if your data is already warm.

As a rough rule: if your dataset fits comfortably in CPU L3 cache (tens of MB), or if transfer time is more than 30% of your total compute time, GPU probably doesn't help.

3. Branch-heavy, control-flow-heavy code

Code with many if/else branches that different data items take different paths through is a poor fit. Warp divergence means every thread in a warp must execute both branches serially.

Example: Decision tree inference (many branches based on feature values). GPU implementations exist but are complex; CPU often wins for single-tree inference.

4. Sequential algorithms

Algorithms where step N depends on the output of step N-1 can't be parallelized. A linked list traversal, a recursive depth-first search, a sequential state machine — these run on a single thread and gain nothing from GPU parallelism.

Rule of thumb: if your algorithm can't be expressed as "apply this function to all elements independently," it probably doesn't GPU-accelerate well.

5. Microsecond-latency requirements

CUDA kernel launch has overhead: ~5–10 microseconds just to start a kernel, even a trivial one. For high-frequency trading, real-time control systems, or anything where sub-100μs response is required, this overhead is unacceptable.

HFT, for example: stays almost entirely on CPU + FPGA. The determinism and sub-microsecond latency of CPU cache-resident code is irreplaceable.

6. Sporadic, low-frequency jobs

If your batch job runs once per day and takes 15 minutes, renting a GPU for 15 minutes ($0.05) vs CPU for 2 hours ($0.20) saves $0.15/day — $55/year. Not worth the operational complexity of managing GPU instances unless you have many such jobs.

Decision tree: should you use a GPU?

Should I use a GPU? Is your workload data-parallel? (same op on many independent items) │ No ──┤── Stay on CPU. GPU won't help. │ Yes │ Is your data large enough? (>100MB, or >100K vectors) │ No ──┤── CPU is fast enough. Transfer overhead not worth it. │ Yes │ Is your bottleneck I/O (network/disk), not compute? │ Yes ──┤── Fix I/O first. GPU won't help. │ No │ Do you need sub-millisecond latency? │ Yes ──┤── CPU only. GPU kernel launch overhead ~5-10μs. │ No │ Is the algorithm sequential (step N depends on step N-1)? │ Yes ──┤── Hard to parallelize. Stay on CPU. │ No │ ✓ GPU will likely help. Start with a high-level library (cudf.pandas, PyTorch, vLLM) before writing any CUDA.

12. Your First GPU Program

The fastest path from zero to running GPU code, depending on your background.

Easiest: Google Colab (free, no setup)

# In Google Colab:
# Runtime → Change runtime type → T4 GPU → Save

# Verify you have a GPU
import torch
print(torch.cuda.is_available())        # True
print(torch.cuda.get_device_name(0))    # NVIDIA T4

import subprocess
subprocess.run(["nvidia-smi"])           # Shows GPU memory and utilization

Data engineer: cudf.pandas in 5 minutes

# On Colab with GPU runtime:
!pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com -q

import cudf.pandas
cudf.pandas.install()

import pandas as pd
import numpy as np

# Generate 10M row dataset
n = 10_000_000
df = pd.DataFrame({
    "user_id": np.random.randint(0, 100000, n),
    "amount": np.random.randn(n) * 100,
    "category": np.random.choice(["A", "B", "C", "D"], n),
})

# This now runs on GPU — same pandas API
result = df.groupby(["user_id", "category"]).agg(
    total=("amount", "sum"),
    avg=("amount", "mean"),
    count=("amount", "count"),
).reset_index()

print(result.head())
print(f"Rows in result: {len(result):,}")

ML engineer: MNIST on GPU (30 minutes)

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Data
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])
train_data = datasets.MNIST(".", train=True, download=True, transform=transform)
loader = DataLoader(train_data, batch_size=256, shuffle=True, num_workers=2)

# Model
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(784, 256), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(128, 10)
).to(device)

optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

# Train 5 epochs
for epoch in range(5):
    correct = 0
    for x, y in loader:
        x, y = x.to(device), y.to(device)  # ← GPU transfer
        optimizer.zero_grad()
        out = model(x)
        loss = criterion(out, y)
        loss.backward()
        optimizer.step()
        correct += (out.argmax(1) == y).sum().item()
    print(f"Epoch {epoch+1}: {correct/len(train_data)*100:.1f}% accuracy")

# On GPU: ~15 seconds total. On CPU: ~3 minutes.

LLM engineer: vLLM quickstart

# Requires GPU with at least 16 GB VRAM (A10G, A100, or L4)
pip install vllm

# Serve Mistral 7B
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.3 \
    --dtype bfloat16 \
    --max-model-len 4096

# Test it
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"mistralai/Mistral-7B-Instruct-v0.3",
         "messages":[{"role":"user","content":"What is CUDA?"}],
         "max_tokens":200}'

Cloud options for getting a GPU

Platform	GPU	Price	Best for
Google Colab	T4 (free), A100 (Pro)	Free / $10/mo	Learning, prototyping
Kaggle Notebooks	T4 / P100	Free (30 hrs/wk)	Learning, competitions
Lambda Labs	A10, A100, H100	$0.50–$3.50/hr	Training, serious dev
Vast.ai	Wide variety	$0.20–$2/hr	Budget training
AWS (g4dn.xlarge)	T4	$0.53/hr on-demand, $0.16 spot	Production inference
GCP (a2-highgpu-1g)	A100 40 GB	$3.67/hr on-demand, $1.10 spot	Production training

13. GPU Memory Management

The number one source of production GPU headaches. Unlike system RAM (which can use swap to extend beyond physical limits), GPU VRAM is strictly bounded. Exceed it and you get an out-of-memory error, not a slowdown.

The error every ML engineer knows

# The most common GPU error you will encounter:
# RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
# (GPU 0; 15.90 GiB total capacity; 13.47 GiB already allocated)

This happens because you tried to allocate more memory than the GPU has available. Common causes: batch size too large, model too large for your GPU, memory not being freed from previous iterations, or multiple processes sharing the same GPU.

How GPU memory gets used in a training job

import torch

# Check current GPU memory usage
def print_memory_stats():
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    print(f"Allocated: {allocated:.2f} GB | Reserved: {reserved:.2f} GB")

# Memory is consumed by:
# 1. Model parameters (fixed, based on model size)
# 2. Gradients (same size as parameters, during backward)
# 3. Optimizer states (2x parameters for Adam: m + v)
# 4. Activations (proportional to batch size × model depth)
# 5. PyTorch's own allocator overhead

# For a 7B parameter model in FP16:
# Parameters:     7B × 2 bytes = 14 GB
# Gradients:                   = 14 GB
# Adam optimizer:              = 28 GB  (2 states × 14 GB)
# Total minimum:               = 56 GB  ← why H100 80GB is popular

Strategies to reduce memory usage

import torch
from torch.amp import autocast  # PyTorch 2.0+ API

# Strategy 1: Reduce batch size (most direct lever)
# batch_size = 256 → OOM
# batch_size = 32 → works. Use gradient accumulation to simulate large batch:
accumulation_steps = 8
optimizer.zero_grad()
for i, (x, y) in enumerate(loader):
    x, y = x.to(device), y.to(device)
    with autocast("cuda"):
        output = model(x)
        loss = criterion(output, y) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# Strategy 2: Mixed precision (halves activation memory)
# FP32 → FP16: 2x memory reduction for activations
with autocast("cuda", dtype=torch.bfloat16):
    output = model(x)

# Strategy 3: Gradient checkpointing (trade compute for memory)
# Recomputes activations during backward instead of storing them
# ~30% slower but saves 60-70% activation memory
from torch.utils.checkpoint import checkpoint
output = checkpoint(model.layers[0], x)  # per layer

# For transformer models (HuggingFace):
model.gradient_checkpointing_enable()

# Strategy 4: Clear cache between phases
torch.cuda.empty_cache()  # releases unused reserved memory (not allocated)
del tensor_you_no_longer_need  # explicit deletion + Python GC

Monitoring GPU memory

# Watch GPU stats in terminal (refresh every 0.5s)
watch -n 0.5 nvidia-smi

# One-liner for memory summary
nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu \
           --format=csv

# Install gpustat for a nicer view
pip install gpustat
gpustat --watch  # colored, per-process breakdown

# Detailed per-tensor memory breakdown (great for debugging OOM)
print(torch.cuda.memory_summary(device=None, abbreviated=False))

# Profile memory during training to find the leak
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA],
    profile_memory=True
) as prof:
    output = model(x)
    loss = criterion(output, y)
    loss.backward()
print(prof.key_averages().table(sort_by="cuda_memory_usage"))

PCIe bottleneck: minimize CPU↔GPU transfers

# BAD: transferring small tensors one at a time
for item in large_list:
    tensor = torch.tensor(item).to("cuda")  # PCIe transfer per item
    result = model(tensor)

# GOOD: batch your transfers
batch = torch.tensor(large_list)  # one large CPU tensor
batch_gpu = batch.to("cuda")      # one transfer — 32 GB/s vs 32x overhead
results = model(batch_gpu)

# GOOD: pin memory for faster CPU→GPU transfers
loader = DataLoader(
    dataset,
    batch_size=256,
    pin_memory=True,      # allocates page-locked memory (~1.5x faster transfers)
    num_workers=4,        # load CPU data in background while GPU computes
)

Multi-GPU memory strategies

When a model is too large for one GPU's VRAM, you have three options:

Data parallelism (DDP): same model on each GPU, different data shards. Gradients averaged. Requires model fits on one GPU.
Tensor parallelism: split individual matrices across GPUs. Used by Megatron-LM. Complex to implement.
Pipeline parallelism: different transformer layers on different GPUs. DeepSpeed ZeRO handles this. Middle ground.

14. Cost & ROI

Before renting a GPU, you need numbers. Here are real ones.

GPU cloud pricing (2026)

GPU	VRAM	On-demand ($/hr)	Spot ($/hr)	1-yr reserved ($/hr)	Provider
T4	16 GB	$0.35–0.53	$0.10–0.16	$0.22	Lambda, AWS, GCP
A10G	24 GB	$0.75–1.10	$0.25–0.40	$0.45	AWS, Lambda
A100 (40 GB)	40 GB	$2.00–3.00	$0.60–1.00	$1.20	Lambda, GCP, AWS
A100 (80 GB)	80 GB	$2.50–4.00	$0.75–1.50	$1.50	Lambda, Azure, GCP
H100 (80 GB)	80 GB	$3.50–7.00	$1.50–3.00	$2.00–2.50	CoreWeave, Lambda, Azure
H100 (8x DGX)	640 GB	$28–56/hr	$12–24/hr	$16–20/hr	CoreWeave, GCP, Azure

ROI scenarios

Workload	CPU cost	GPU cost	GPU savings	Break-even
LLM inference (1M tokens/day, Llama 8B)	$12.00/day (CPU cluster)	$0.50/day (T4, spot)	96% cheaper	Immediate
Data processing (1 TB/day ETL)	$18/day (large CPU instance)	$3/day (8x T4 spot, 45 min)	83% cheaper	Immediate
ML training (1B param model)	$240 (CPU, 4 days)	$18 (A100, 6 hrs)	92% cheaper	Immediate
Batch image processing (10M images/day)	$8/day	$1.50/day	81% cheaper	Immediate
Small analytics job (10 GB, twice/week)	$0.40/week (c5.2xlarge)	$0.80/week (G4dn, setup overhead)	CPU is 2x cheaper	Never

When GPU is NOT worth it (cost perspective)

Jobs that run <2 hours/week: operational overhead (driver versions, CUDA compatibility, OOM debugging) costs more in engineer time than the savings.
Data under 500 MB: a fast CPU instance is often cheaper and simpler.
Highly variable, unpredictable workloads: GPU reserved pricing is only worthwhile at high utilization (70%+).
Team has no GPU expertise: onboarding cost. If nobody on the team has debugged a CUDA OOM error, factor in learning time.

Cost optimization tips

Spot instances: 50–70% off on-demand. Use for training (can checkpoint), not for always-on inference.
Right-size your GPU: a T4 at $0.35/hr often handles batch inference jobs that engineers assume need an A100. Test before paying 10x more.
Batch requests: GPU throughput is highest with large batches. A server handling 1 req/s at batch_size=1 might use 5% GPU utilization. Batch 20 requests together and you use 90% at the same cost.
Quantization: INT8 models run in half the VRAM and twice the throughput vs FP16. Use llm.int8() in bitsandbytes or TensorRT INT8 for inference.

15. The Vera Rubin Generation (GTC 2026) 2026

NVIDIA's GTC 2026 announcements define the direction for the next 2–3 years. Here's what's coming and what it means for you.

Blackwell vs. Vera Rubin: architecture timeline

These three product names are often conflated — here's the precise breakdown:

B200 — Blackwell GPU die (standalone data center GPU, 192 GB HBM3e). Current flagship, shipping 2025.
GB200 — Grace Blackwell: a single package combining the B200 GPU die with a Grace ARM CPU. Eliminates PCIe bottleneck — unified memory pool at ~900 GB/s CPU↔GPU bandwidth instead of ~32 GB/s via PCIe.
Vera Rubin — the next architecture after Blackwell (announced GTC 2026, shipping H2 2026–2027). Vera Rubin succeeds Blackwell the way Blackwell succeeded Hopper (H100). Its GPU die (~2,250 TFLOPS FP16) will appear in both standalone and Grace-paired (GR200) form factors.

The 192 GB HBM3e VRAM spec applies to the B200 (standalone) and the GB200 (Grace Blackwell) package today. Vera Rubin will bring further capacity and bandwidth increases.

BlueField-4 DPU

The BlueField-4 DPU (Data Processing Unit) offloads networking and security operations from the CPU/GPU. In a dense GPU cluster, a significant portion of CPU time is consumed by network protocol handling, encryption, and storage I/O. DPUs move this to dedicated silicon, freeing the GPU and CPU for computation. For ML clusters, this means higher effective GPU utilization at scale.

NVLink at rack scale

With NVLink Switch Systems, NVIDIA is scaling from 8 GPUs connected (HGX today) to entire racks and eventually buildings of GPUs acting as a unified compute fabric. The NVLink 5 system (announced 2026) connects up to 576 GPUs with 1.8 TB/s per-GPU bandwidth — enough to train models too large to fit on any single cluster today.

"Huang's Law"

Unlike Moore's Law (transistor density), Huang's Law describes GPU performance gains from three compounding factors: new compute architectures, increased memory bandwidth, and improved interconnects. NVIDIA has delivered roughly 1,000x AI performance improvement per decade — faster than traditional semiconductor scaling.

Platform integration: where GPUs are appearing

Platform	GPU integration	For you
Snowflake	GPU-accelerated queries (RAPIDS inside)	SQL queries run on GPU, no config change
Databricks	GPU clusters for ML, Spark RAPIDS	Check "GPU enabled" when creating cluster
Google BigQuery	GPU-accelerated ML functions	ML.PREDICT can use GPU backend
AWS SageMaker	Managed GPU training + inference	Select GPU instance type in training job config
Azure ML	GPU compute clusters, NVIDIA AI Enterprise	Standard for enterprise ML in Azure ecosystem

The implication for your career: GPU skills are transitioning from "ML specialist" to "data engineer and backend engineer baseline knowledge." The platforms you already use are quietly adding GPU acceleration under the hood, and understanding the model helps you use them effectively.

16. Key Terminology Glossary

Compute concepts

Term	Definition
CUDA core	Basic FP32 arithmetic unit on NVIDIA GPU. H100 has 16,896. Not the same as a CPU core — much simpler, designed for throughput not latency.
Tensor Core	Specialized unit for matrix multiply-accumulate in mixed precision (FP16/BF16/INT8/FP8). Much faster than CUDA cores for deep learning. H100 has 528 Tensor Cores.
SM (Streaming Multiprocessor)	The major organizational unit of a GPU — like a CPU core, but with many more arithmetic units. H100 has 132 SMs, each with 128 CUDA cores.
Warp	Group of 32 threads that execute the same instruction in lock-step (SIMT). The fundamental scheduling unit of a GPU SM.
Block	Programmer-defined group of 1–1,024 threads that share shared memory and can synchronize. Scheduled to run on a single SM.
Grid	The full set of blocks launched for a single kernel call. Can contain millions of blocks.
Kernel	A function executed on the GPU by many threads simultaneously. Written in CUDA C/C++ or Triton, called from host (CPU) code.
Occupancy	Ratio of active warps to maximum possible warps on an SM. Higher occupancy generally means better latency hiding. Low occupancy = underutilized GPU.
Kernel fusion	Combining multiple GPU operations into a single kernel to reduce memory round-trips. Key optimization in TensorRT and FlashAttention.

Memory concepts

Term	Definition
HBM (High Bandwidth Memory)	3D-stacked DRAM used as GPU VRAM. HBM3 achieves 3.35 TB/s bandwidth — 10x+ faster than DDR5 system RAM.
Shared memory	Fast on-chip SRAM shared by all threads in a block. ~96 KB on H100. The programmer controls what goes here. Think of it as a manually managed L1 cache.
Pinned memory	CPU RAM that is page-locked and not swappable. CPU→GPU transfers from pinned memory are ~2x faster than pageable memory.
Memory coalescing	When 32 threads in a warp access consecutive memory addresses, they're served in a single memory transaction. Non-coalesced access (scattered addresses) requires multiple transactions — major performance penalty.
NVLink	Direct GPU-to-GPU interconnect, bypassing PCIe. NVLink 4.0: 900 GB/s. Used in multi-GPU training to share gradients efficiently.
NVSwitch	Switch chip that creates all-to-all NVLink topology in DGX systems. Every GPU can communicate with every other GPU at full NVLink speed.
PCIe	Standard bus connecting GPU to CPU. Gen 5: 64 GB/s. The bottleneck for CPU↔GPU data transfer — 50x slower than NVLink.

Precision formats

Format	Bits	Range	Use case
FP64	64	±10^±308	Scientific computing where precision matters. CPUs are great; GPUs are slower.
FP32	32	±10^±38	Default for training. Safe, less likely to overflow or underflow.
TF32	19 (internally)	Same as FP32	NVIDIA internal format for Tensor Cores. Auto-used in PyTorch by default. Same accuracy as FP32, 3x faster.
BF16	16	Same range as FP32	Preferred for LLM training. Same exponent as FP32 (no overflow), lower mantissa precision. Supported on A100+.
FP16	16	±65,504	Mixed precision training on older GPUs (V100, T4). Smaller range than BF16 — requires gradient scaling to prevent underflow.
FP8	8	Limited	Inference quantization. H100/H200 support FP8 Tensor Cores natively. ~2x speed vs FP16 with careful calibration.
INT8	8	-128 to 127	Post-training quantization for inference. Up to 4x throughput vs FP32. Requires calibration dataset. TensorRT INT8 is standard.
INT4	4	-8 to 7	Aggressive quantization (llama.cpp, GGUF format). Fits 70B model on 48 GB VRAM. ~1% accuracy loss vs FP16.

Performance metrics

Term	Definition
FLOPS	Floating-Point Operations Per Second. Measures compute throughput.
TFLOPS	Tera-FLOPS = 10¹² FLOPS. H100: 67 TFLOPS (FP32), 989 TFLOPS (TF32), 1,979 TFLOPS (FP16).
PFLOPS	Peta-FLOPS = 10¹⁵ FLOPS. Top500 supercomputer benchmark uses PFLOPS. A DGX H100 (8 GPUs): ~32 PFLOPS (FP16).
MFU (Model FLOPs Utilization)	Percentage of theoretical peak FLOPS your training actually achieves. 40–60% MFU is good. Lower means memory-bound or poorly batched.
Quantization	Reducing numeric precision of model weights (FP32 → INT8 → INT4) to shrink VRAM usage and increase throughput. Slight accuracy trade-off.

17. Learning Roadmap

A structured path from "never used a GPU" to "production GPU engineer," organized by week with concrete deliverables.

Week 1: First contact — understand the model

Set up Google Colab with GPU runtime. Run nvidia-smi, check CUDA version.
Run the MNIST training example from Section 12. Measure GPU vs CPU time.
Experiment: change batch size from 32 to 1024. What happens to GPU utilization? Speed?
Read the CUDA Programming Guide's first 3 chapters (free on NVIDIA docs). Just for mental model, no coding required.
Deliverable: A training loop that runs on both CPU and GPU, with timing comparison printout.

Week 2: Real workloads — memory and performance

Train a real model: fine-tune distilbert-base-uncased on a sentiment dataset (HuggingFace Trainer makes this ~20 lines).
Intentionally OOM your GPU: set batch_size very high. Read the error. Fix it with gradient accumulation.
Enable mixed precision (torch.amp). Measure speed and memory difference vs FP32.
Run torch.cuda.memory_summary() and understand the output.
Deliverable: Fine-tuned sentiment model saved to disk, training script with AMP and memory profiling.

Week 3: Data and inference acceleration

Install cuDF on Colab. Run a groupby on a 10M row DataFrame. Compare to pandas timing.
Deploy a 7B model with vLLM on a rented GPU (Lambda Labs A10G = $0.75/hr). Benchmark tokens/sec.
Enable INT8 quantization in vLLM (--quantization awq). Compare memory usage and throughput vs BF16.
Cross-reference: read the PyTorch Refresher for training patterns, LLMs Refresher for inference architecture.
Deliverable: vLLM server running, OpenAI-compatible endpoint tested, performance numbers documented.

Week 4: Multi-GPU and production readiness

Run a DDP (DistributedDataParallel) training job on 2 GPUs using torchrun. Measure linear scaling.
Explore TensorRT: export a PyTorch model, build a TRT engine, benchmark vs PyTorch eager mode.
Cost exercise: calculate the break-even for GPU vs CPU for your most compute-intensive work task.
Study the "When NOT to Use" section critically: identify 3 things in your current stack where someone might cargo-cult GPU and why it wouldn't help.
Deliverable: Written analysis: "Should we use GPU for [X at your job]?" with numbers.

Resources

Resource	Format	Best for
NVIDIA DLI: Fundamentals of Accelerated Computing	Interactive course (~8 hrs)	Hands-on CUDA C from scratch
Programming Massively Parallel Processors (Kirk & Hwu)	Textbook	Deep architecture understanding
fast.ai Practical Deep Learning	Free course	Applied ML with GPU, top-down teaching style
Andrej Karpathy: Zero to Hero	YouTube series	Build GPT from scratch, includes GPU training
Triton Tutorials	Code + docs	Write custom GPU kernels in Python (no CUDA C)
PyTorch Official Tutorials	Interactive	PyTorch fundamentals with GPU examples
NVIDIA GTC recordings (YouTube)	Talks (30–60 min each)	Latest techniques, architecture announcements

The fastest path: use before understanding

You don't need to understand SIMT architecture to add .to("cuda") to a PyTorch training loop. Start with the practical use cases (Section 4–8) that apply to your current work. Come back to the architecture sections when you hit performance problems and need to diagnose why.