Table of Contents

1. Why Accelerated Computing?

For decades, your code got faster for free. Every 18 months, the new CPU was 2x as fast. You could ship slower code and rely on hardware to bail you out. That era ended around 2005.

CPU clock speeds hit a wall at roughly 4 GHz. Two fundamental limits collide: power density (faster clocks burn more power, which melts chips) and Dennard scaling (the property that let transistors run cooler as they shrank stopped working at small node sizes). Intel's "4 GHz barrier" from 2004 is still, in 2026, a barrier.

The answer the industry landed on: do many things at once instead of one thing faster. CPUs added more cores (2, 4, 8, 64). But GPUs took this to an extreme: thousands of simpler cores working in parallel.

The core insight: latency vs. throughput

CPUs are optimized for latency — getting one task done as fast as possible. They have large caches, branch predictors, out-of-order execution units, and speculative execution — all machinery that reduces how long a single thread waits.

GPUs are optimized for throughput — doing as many tasks simultaneously as possible. They sacrifice single-thread speed for the ability to run thousands of threads at once, all operating on different data.

Think of it this way: a CPU is a few elite sprinters. A GPU is a stadium of ordinary runners — slower individually, but together they cover vastly more ground.

CPU vs GPU: Die Area Usage ┌─────────────────────────────────────────────────────────┐ │ CPU (e.g., Intel Core i9) GPU (e.g., H100 SXM5) │ │ │ │ ┌──────┐ ┌──────┐ ┌──────────────────┐ │ │ │Core 0│ │Core 1│ │ ████ ████ ████ │ │ │ │ L1/L2│ │ L1/L2│ │ ████ ████ ████ │ │ │ └──────┘ └──────┘ │ ████ ████ ████ │ │ │ ┌──────┐ ┌──────┐ │ ████ ████ ████ │ │ │ │Core 2│ │Core 3│ Large L3 → │ ████ ████ ████ │ │ │ │ L1/L2│ │ L1/L2│ ┌────────┐ │ ████ ████ ████ │ │ │ └──────┘ └──────┘ │ Cache │ │ ████ ████ ████ │ │ │ ┌──────┐ ┌──────┐ │ (64MB) │ │ ████ ████ ████ │ │ │ │Core 4│ │Core 5│ └────────┘ │ (16,896 cores) │ │ │ └──────┘ └──────┘ └──────────────────┘ │ │ │ │ Most die area: cache + logic Most die area: COMPUTE │ │ 8-64 fast, complex cores Thousands simple cores │ │ Branch predict, out-of-order In-order, SIMT │ └─────────────────────────────────────────────────────────┘

CPU vs GPU at a glance

Property Modern CPU (e.g., AMD EPYC) Modern GPU (e.g., H100 SXM5)
Core count 8–128 cores 2,000–20,000 CUDA cores
Clock speed 3–5 GHz 1–2 GHz
Memory bandwidth 50–300 GB/s (DDR5) 500–3,350 GB/s (HBM3e)
Memory capacity Up to 6 TB (server) 24–192 GB (VRAM)
Peak FLOPS ~2 TFLOPS (FP32) ~67 TFLOPS (FP32 CUDA cores), 989 TFLOPS (TF32 Tensor Cores) — H100 SXM5
Best for Sequential, branchy, latency-sensitive Parallel, data-parallel, throughput
Programmed with Any language CUDA/HIP/SYCL or high-level frameworks

Jensen Huang, NVIDIA's CEO, declared at GTC 2026: "All computing will be accelerated computing." Whether or not you believe the absolutism, the direction is clear. ML training, data pipelines, LLM inference, vector search — workloads that used to be CPU-bound are routinely 10–100x faster on GPU today.

The practical implication
If your job involves data processing, machine learning, LLM serving, scientific simulation, or real-time video — GPU skills are becoming table-stakes. This page explains what you actually need to know to start using them.

2. GPU vs CPU Architecture

You know CPUs. Let's build GPU understanding from that foundation.

What makes a CPU fast for your typical code

When you call a function that does an if/else based on a database result, then calls another function based on that result — this is a control-flow-heavy, data-dependent workload. CPUs are engineered for exactly this pattern:

All of this silicon is in service of making one thread run fast.

What a GPU is actually doing

A GPU is built around a different assumption: you have a regular, data-parallel problem. The same operation applied to millions of independent data points. "Add these two arrays." "Multiply these matrices." "Apply this activation function to these 65,536 values."

GPUs strip out most of the complexity that makes CPUs fast at sequential code, and put that transistor budget into more compute units instead.

Key GPU architectural concepts you need to know:

Streaming Multiprocessors (SMs)

An H100 has 132 SMs. Each SM is something like a mini-CPU with its own set of CUDA cores, registers, shared memory, and schedulers. Think of them as departments in a very large company — each runs somewhat independently.

SIMT: Single Instruction, Multiple Threads

Within an SM, threads are grouped into warps of 32. All 32 threads in a warp execute the same instruction at the same time, but each on different data. This is SIMT: Single Instruction, Multiple Threads.

If thread 0 is adding element 0 of array A to element 0 of array B, thread 1 is simultaneously adding element 1 to element 1, ..., thread 31 is adding element 31. One instruction, 32 results produced in parallel.

Warp divergence: why branching hurts on GPUs

This is the single most important thing to understand about GPU performance.

Since 32 threads share an instruction, what happens when some threads need to take an if branch and others need to take the else? Both paths execute. The threads that aren't supposed to be on a given path are masked off (their results are discarded), but the time is spent. You get worst-case performance: sequential execution of both paths, zero parallelism.

Warp divergence — the GPU performance killer
Code with many if/else branches where different threads take different paths is a bad fit for GPUs. The 32 threads in a warp must execute the same instruction. Divergence means serialization.

Memory hierarchy: bandwidth over latency

GPUs have a very different memory structure than CPUs:

GPU Memory Hierarchy (H100 example) Latency Bandwidth ┌─────────────────────────────────────────────────────────────────┐ │ Registers (per-thread) ~1 cycle ~20 TB/s │ │ ┌───────────────────────────────┐ │ │ │ Thread 0 │ Thread 1 │ ... │ Each thread: ~64K regs │ │ └───────────────────────────────┘ │ │ │ │ Shared Memory / L1 Cache (per-SM) ~5 cycles ~12 TB/s │ │ ┌───────────────────────────────┐ │ │ │ Shared across warp/block │ ~96 KB configurable │ │ └───────────────────────────────┘ (your most important knob) │ │ │ │ L2 Cache (chip-wide) ~40 cycles ~12 TB/s │ │ ┌───────────────────────────────┐ │ │ │ All SMs share this │ 50–60 MB │ │ └───────────────────────────────┘ │ │ │ │ HBM (Global Memory / VRAM) ~600 cycles 3.35 TB/s│ │ ┌───────────────────────────────┐ │ │ │ Main GPU memory, all threads │ 80 GB (H100 SXM5) │ │ └───────────────────────────────┘ │ │ │ │ System RAM (host memory, via PCIe) ~32 GB/s │ │ ┌───────────────────────────────┐ │ │ │ CPU's memory, must transfer │ ~1 TB (server) │ │ └───────────────────────────────┘ ← the bottleneck to watch │ └─────────────────────────────────────────────────────────────────┘

The key takeaway: GPU global memory (HBM) has enormous bandwidth — 3.35 TB/s on an H100. Your CPU can barely manage 200 GB/s on a high-end server. This is why matrix multiplications are so fast on GPU: they need to read a lot of data quickly, and HBM delivers it.

But the CPU-to-GPU transfer over PCIe is slow (~32 GB/s). Every time you move data from RAM to VRAM or back, you pay this cost. It's the primary reason small datasets don't benefit from GPU acceleration: the transfer overhead exceeds the compute savings.

Property CPU Core (Zen 4) GPU SM (H100)
Per-core clock ~5 GHz ~1.8 GHz
L1 cache 32–64 KB 256 KB (shared/L1)
Branch prediction Yes, very good None (warps)
Out-of-order exec Yes No
Threads per unit 1 (+ hyperthreading) 2,048 concurrent threads per SM
Designed for Low-latency, single-thread speed High-throughput, data-parallel work

3. The CUDA Programming Model

You almost certainly will never write raw CUDA code in your career. But understanding the mental model is what lets you reason about performance, debug memory errors, and pick the right tool for your problem. Think of this as reading a map before you drive.

The execution hierarchy: Grid → Block → Thread → Warp

When you launch a GPU computation, you're launching a kernel: a function that executes on the GPU, run by thousands of threads simultaneously.

Those threads are organized in a hierarchy:

CUDA Execution Hierarchy ┌─────────────────────────────────────────┐ │ GRID (one kernel launch) │ │ ┌────────────────────────────────┐ │ │ │ Block (0,0) │ Block (1,0) │ ... │ │ │ ┌──────────┐ │ ┌──────────┐ │ │ │ │ │ Thread 0 │ │ │ Thread 0 │ │ │ │ │ │ Thread 1 │ │ │ Thread 1 │ │ │ │ │ │ Thread 2 │ │ │ Thread 2 │ │ │ │ │ │ ... │ │ │ ... │ │ │ │ │ │ Thread 31│ │ │ Thread 31│ │ │ │ │ └──────────┘ │ └──────────┘ │ │ │ │ (= 1 warp) │ (= 1 warp) │ │ │ └────────────────────────────────┘ │ │ │ │ Blocks map to SMs (GPU departments) │ │ Warps (32 threads) execute together │ └─────────────────────────────────────────┘

A concrete example: vector addition

The "hello world" of GPU programming. Add two arrays, element by element. On a CPU, you'd write a loop. On a GPU, each thread handles one addition:

// CUDA C++ kernel — each thread adds one element
__global__ void vector_add(float *a, float *b, float *result, int n) {
    // Figure out which element this thread handles
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Guard: some blocks may be launched with threads beyond array end
    if (idx < n) {
        result[idx] = a[idx] + b[idx];
    }
}

// Host code: launch the kernel
int main() {
    int n = 1000000;  // 1 million elements
    float *d_a, *d_b, *d_result;
    // ... allocate and populate h_a, h_b, h_result on host (CPU) side ...

    // Allocate GPU memory
    cudaMalloc(&d_a, n * sizeof(float));
    cudaMalloc(&d_b, n * sizeof(float));
    cudaMalloc(&d_result, n * sizeof(float));

    // Copy data from CPU to GPU
    cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);

    // Launch: 256 threads per block, enough blocks to cover all n elements
    int threads = 256;
    int blocks = (n + threads - 1) / threads;  // = 3907 blocks
    vector_add<<>>(d_a, d_b, d_result, n);

    // Copy results back to CPU
    cudaMemcpy(h_result, d_result, n * sizeof(float), cudaMemcpyDeviceToHost);

    // 3,907 blocks × 256 threads = ~1M threads ran simultaneously
    return 0;
}

Three phases: load (copy data to GPU), launch (run kernel), read back (copy results to CPU). This pattern is universal. The transfer cost is why small datasets often don't benefit.

Thread indexing: how threads know what to work on

Every thread can read three built-in variables to figure out which piece of data it should process:

The global element index: int idx = blockIdx.x * blockDim.x + threadIdx.x. Block 0 threads handle elements 0–255, block 1 threads handle 256–511, etc.

You don't write CUDA to use GPUs in practice
Libraries like PyTorch, cuDF, and CuPy have already written the CUDA kernels for you. Understanding the model helps you reason about performance, but you won't be writing __global__ functions in production ML or data work.

4. "My Pandas Code Takes 20 Minutes" — GPU for Data

This is the most accessible GPU use case for most software engineers. No ML knowledge required. Your existing pandas, Spark, or Polars code can often run dramatically faster on a GPU with minimal changes.

cudf.pandas: zero code changes, 10–150x faster

NVIDIA's cuDF library implements the pandas API on the GPU. The cudf.pandas accelerator is a drop-in: add one line at the top of your script, and pandas operations run on GPU transparently.

# Before: vanilla pandas, 20+ minutes on 10GB CSV
import pandas as pd

df = pd.read_csv("transactions_10gb.csv")
result = (
    df.groupby(["customer_id", "product_category"])
    .agg({"amount": ["sum", "mean", "count"]})
    .reset_index()
    .merge(df[["customer_id", "region"]].drop_duplicates(), on="customer_id")
)
result.to_csv("output.csv", index=False)
# After: cudf.pandas accelerator — identical code, GPU execution
import cudf.pandas  # ← the only change
cudf.pandas.install()

import pandas as pd  # pandas is now GPU-accelerated

df = pd.read_csv("transactions_10gb.csv")
result = (
    df.groupby(["customer_id", "product_category"])
    .agg({"amount": ["sum", "mean", "count"]})
    .reset_index()
    .merge(df[["customer_id", "region"]].drop_duplicates(), on="customer_id")
)
result.to_csv("output.csv", index=False)
# Runtime: ~25 seconds instead of 23 minutes (55x faster on T4 GPU)
# Install
pip install cudf-cu12  # for CUDA 12.x
# Or on Colab: !pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com

When cudf.pandas falls back to CPU

Not every pandas operation has a GPU implementation. cudf.pandas automatically falls back to CPU for unsupported operations, so your code never breaks — it just won't always be GPU-accelerated. Check with:

import cudf.pandas
cudf.pandas.install()
# At end of script, print what ran on GPU vs CPU:
cudf.pandas.profile()

Spark RAPIDS: 5x faster Spark, no code changes

If you're using Apache Spark for large-scale data processing, RAPIDS Accelerator for Spark is a plugin that runs Spark SQL operations on GPUs. It intercepts physical plan nodes and replaces CPU executors with GPU equivalents — zero code changes required, just a config flag.

# Add to spark-submit
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.enabled=true \
--conf spark.executor.resource.gpu.amount=1

# Or in PySpark:
spark = SparkSession.builder \
    .config("spark.plugins", "com.nvidia.spark.SQLPlugin") \
    .config("spark.rapids.sql.enabled", "true") \
    .getOrCreate()

Benchmark: a typical ETL pipeline on 1TB of Parquet — Spark on CPU: 3.5 hours. Spark RAPIDS on 8x A100: 42 minutes. At cloud spot pricing, GPU actually costs less due to shorter runtime.

Polars GPU backend

# Polars 1.x with GPU backend
# Install (run in terminal or Jupyter cell with ! prefix):
#   pip install polars[gpu]   # installs cuDF dependency

import polars as pl

# Enable GPU engine
df = pl.scan_parquet("data/*.parquet")
result = (
    df.filter(pl.col("amount") > 100)
    .group_by("category")
    .agg(pl.col("amount").sum())
    .collect(engine="gpu")  # ← run on GPU
)

Benchmarks: when GPU data processing pays off

Operation Data Size CPU Time GPU Time (T4) Speedup
groupby + agg 1 GB 45s 2s 22x
join two tables 5 GB + 500 MB 3.5 min 8s 26x
sort + dedup 2 GB 90s 4s 22x
string ops (ILIKE) 500 MB 25s 12s 2x (strings are hard)
custom Python lambda Any baseline baseline (fallback) 1x (falls back to CPU)
When GPU data processing doesn't help
  • Data under ~100 MB: CPU+RAM is fast enough; transfer overhead dominates
  • I/O-bound pipelines: if you're waiting on network/disk, the GPU sits idle
  • Custom Python lambdas: df.apply(my_python_func) can't run on GPU — it needs Python interpreter per row
  • Streaming row-by-row: GPU excels at batch operations, not one-record-at-a-time processing

For more on cuDF and data transform tools, see the Data Transformation Refresher.

5. "Training Takes 3 Days" — GPU for ML

This is the bread-and-butter GPU use case. Training a neural network is fundamentally matrix multiplication repeated millions of times — exactly what GPUs are built for.

The two-line GPU upgrade in PyTorch

import torch
import torch.nn as nn

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")  # "cuda" on Colab with GPU runtime

# Define a model
model = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(),
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

# Move model to GPU — this copies all parameters to VRAM
model = model.to(device)

# Training loop
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    for batch_x, batch_y in dataloader:
        # Move data to same device as model
        batch_x = batch_x.to(device)  # ← CPU→GPU transfer happens here
        batch_y = batch_y.to(device)

        optimizer.zero_grad()
        output = model(batch_x)       # forward pass on GPU
        loss = criterion(output, batch_y)
        loss.backward()               # backward pass on GPU
        optimizer.step()

That's it for a single GPU. model.to("cuda") and data.to("cuda") — two patterns, one rule. All computation between those .to(device) calls runs on the GPU.

Mixed precision training: 2x faster, half the memory

By default, PyTorch uses FP32 (32-bit floats). Switching to FP16 or BF16 halves memory usage and roughly doubles throughput on modern GPUs (which have Tensor Cores designed for FP16/BF16 matmul). The torch.amp (automatic mixed precision) module handles this automatically:

from torch.amp import autocast, GradScaler

scaler = GradScaler("cuda")  # handles FP16 gradient scaling (prevents underflow)

for batch_x, batch_y in dataloader:
    batch_x, batch_y = batch_x.to(device), batch_y.to(device)
    optimizer.zero_grad()

    # Forward pass in FP16 where safe, FP32 where precision matters
    with autocast("cuda"):
        output = model(batch_x)
        loss = criterion(output, batch_y)

    # Backward pass with scaled gradients
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

On an A100, BF16 training is typically 2–3x faster than FP32 with zero accuracy loss. On consumer GPUs (RTX 4090), FP16 with AMP gives similar benefits.

Multi-GPU training: DistributedDataParallel

import os
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Launch with: torchrun --nproc_per_node=4 train.py
dist.init_process_group(backend="nccl")  # NCCL is the GPU comms library
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device(f"cuda:{local_rank}")

model = MyModel().to(device)
model = DDP(model, device_ids=[local_rank])  # wraps model for multi-GPU

# Each GPU processes a different shard of data
# DDP automatically averages gradients across all GPUs after backward()
# With 4xA100: effectively 4x larger batch, ~3.5x faster training

Training time comparison: CPU vs GPU

Workload CPU (32 cores) Single T4 Single A100 8x A100
ResNet-50, 1 epoch (ImageNet) ~6 hours ~22 min ~8 min ~1 min
BERT-base fine-tune (MNLI) ~18 hours ~45 min ~15 min ~2 min
GPT-2 (117M) from scratch days ~8 hours ~2.5 hours ~20 min

For more on PyTorch training patterns, see the PyTorch Refresher. For foundational ML concepts (loss functions, overfitting, regularization), see the Machine Learning Refresher.

6. "Serve 1000 LLM Requests/Second" — GPU for Inference

The hottest GPU use case in 2026. Serving large language models at scale is almost impossible without GPUs. Here's why, and how to do it.

Why LLMs need GPUs

A 7B parameter model has 7 billion floating-point numbers. At FP16 (2 bytes each), that's 14 GB just to store the weights — before you handle any requests. Each inference request involves multiplying these weights by the input embeddings repeatedly across 32+ transformer layers. That's billions of multiply-add operations per request.

On a CPU, generating one token for one user takes ~500ms. On a single A10G GPU, the same computation takes ~5ms — and the GPU can batch 50 concurrent users with minimal throughput penalty, making the effective throughput 100x higher than CPU.

vLLM: the standard for LLM serving

vLLM (Virtual LLM) is the dominant open-source LLM inference server. Its key innovation is PagedAttention: memory management for KV cache that allows batching requests without wasting VRAM on fixed-size allocations.

# Install and start serving Llama 3.1 8B
pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192 \
    --dtype bfloat16
# Server starts on http://localhost:8000
# Compatible with OpenAI API format
# Call the vLLM server (OpenAI-compatible)
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain CUDA in one paragraph."}],
    max_tokens=200,
)
print(response.choices[0].message.content)
# vLLM Python API (in-process, for batch jobs)
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct", dtype="bfloat16")
params = SamplingParams(temperature=0.7, max_tokens=200)

# Send 1000 prompts at once — vLLM batches them automatically
prompts = [f"Summarize: {text}" for text in texts]
outputs = llm.generate(prompts, params)

for output in outputs:
    print(output.outputs[0].text)

TensorRT-LLM: maximum throughput

NVIDIA TensorRT-LLM compiles your model into an optimized inference engine using kernel fusion, quantization, and custom CUDA kernels. It's more complex to set up than vLLM but delivers higher peak throughput — 2–5x over naive PyTorch inference.

# TensorRT-LLM via Docker (recommended setup)
docker pull nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3

# Convert Llama 3.1 8B to TRT engine
python examples/llama/convert_checkpoint.py \
    --model_dir /models/llama-3.1-8b \
    --output_dir /tmp/trt-llama-ckpt \
    --dtype bfloat16

trtllm-build \
    --checkpoint_dir /tmp/trt-llama-ckpt \
    --output_dir /tmp/trt-llama-engine \
    --gemm_plugin bfloat16 \
    --max_batch_size 64 \
    --max_input_len 2048 \
    --max_output_len 512

Cost math: GPU vs CPU for LLM inference

Setup Throughput (tokens/sec) Cost/hr Cost per 1M tokens
CPU (64-core server, Llama 3.1 8B) ~50 tok/s $2.40 $13.33
T4 GPU (16 GB), vLLM ~1,200 tok/s $0.53 $0.12
A10G GPU (24 GB), vLLM ~3,500 tok/s $1.06 $0.08
A100 GPU (80 GB), TRT-LLM ~8,000 tok/s $3.20 $0.11

CPU inference for LLMs is roughly 25–100x more expensive per token than GPU inference. For any production serving load above a few requests per day, GPU pays for itself quickly.

NVIDIA NIM: pre-packaged inference containers

NVIDIA NIM (NVIDIA Inference Microservices) packages optimized models into ready-to-deploy containers. You pull a container, it downloads the model and starts serving. No manual TensorRT-LLM setup required.

# Pull and run a NIM container for Llama 3.1 8B
export NGC_API_KEY=your_key_here
docker run -it --rm --gpus all \
    -v ~/.cache/nim:/opt/nim/.cache \
    -p 8000:8000 \
    -e NGC_API_KEY \
    nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3

# Server is now running, OpenAI-compatible at port 8000
curl http://localhost:8000/v1/models

For more on LLMs and transformers, see the LLMs Refresher.

If you're building RAG systems, semantic search, or recommendation engines, you're doing approximate nearest-neighbor (ANN) search over large vector collections. GPUs accelerate both index building and search dramatically.

CPU Faiss vs GPU Faiss

Meta's Faiss library is the standard for vector search. It has a GPU backend that can be plugged in with minimal code changes:

import numpy as np
import faiss

# Generate 1M vectors, 768 dimensions (typical embedding size)
d = 768
n = 1_000_000
vectors = np.random.randn(n, d).astype("float32")

# --- CPU version ---
index_cpu = faiss.IndexFlatL2(d)
index_cpu.add(vectors)                  # ~45 seconds
distances, indices = index_cpu.search(  # ~8 seconds for 1000 queries
    query_vectors, k=10
)

# --- GPU version ---
res = faiss.StandardGpuResources()
# Copies the already-populated CPU index to GPU — vectors are already added
index_gpu = faiss.index_cpu_to_gpu(res, 0, index_cpu)  # moves to GPU 0
distances, indices = index_gpu.search(  # ~1.0 seconds (8x faster than CPU)
    query_vectors, k=10
)

cuVS: NVIDIA's next-gen vector search

cuVS (CUDA Vector Search) is NVIDIA's dedicated library for GPU-accelerated approximate nearest-neighbor search, now the recommended backend for large-scale deployments:

from cuvs.neighbors import cagra
import cupy as cp

# Move vectors to GPU
gpu_vectors = cp.asarray(vectors)

# Build CAGRA index (graph-based ANN, ~10x faster than CPU HNSW)
index_params = cagra.IndexParams(metric="sqeuclidean")
index = cagra.build(index_params, gpu_vectors)

# Search
search_params = cagra.SearchParams(itopk_size=64)
distances, neighbors = cagra.search(
    search_params, index, query_gpu, k=10
)
# Results back as cupy arrays — stay on GPU for downstream processing

When GPU vector search makes sense

Scenario Recommendation
Under 100K vectors CPU Faiss or pgvector is fast enough. Don't bother.
100K–10M vectors, latency <100ms GPU Faiss or cuVS pays off. Significant speedup.
Batch indexing (millions of new vectors/day) GPU indexing is 5–10x faster. Reduces index rebuild time.
Real-time RAG (<50ms p99) GPU search can achieve this; CPU struggles at scale.
Already have GPU for inference Run vector search on same GPU — zero extra cost.

A practical pattern: if you're already running vLLM on a GPU for LLM inference, run your vector search on the same GPU during idle inference time. GPU utilization becomes near-100% and you get both for the price of one.

8. "Real-Time Object Detection at 30 FPS" — GPU for Vision

Computer vision workloads — object detection, image classification, video analysis — were the original GPU killer app and remain a dominant use case.

YOLOv8 on GPU: 30 FPS vs 2 FPS

from ultralytics import YOLO
import cv2

# Load YOLOv8 nano model (fastest)
model = YOLO("yolov8n.pt")

# Run on GPU
cap = cv2.VideoCapture("video.mp4")

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # model auto-detects GPU and uses it
    results = model(frame, device="cuda:0", verbose=False)

    # Draw bounding boxes
    annotated = results[0].plot()
    cv2.imshow("Detection", annotated)

    # GPU: ~30 FPS at 1080p with yolov8n
    # CPU: ~2 FPS at 1080p with yolov8n (15x slower)
    if cv2.waitKey(1) == ord("q"):
        break

cap.release()
# Install
pip install ultralytics  # includes PyTorch + CUDA dependencies

# Benchmark
yolo benchmark model=yolov8n.pt imgsz=640 device=0  # GPU
yolo benchmark model=yolov8n.pt imgsz=640 device=cpu  # CPU

Video pipeline: full GPU processing

For production video pipelines, you want everything on the GPU — decode, preprocess, inference, and encode — to avoid the PCIe bottleneck between steps:

import torch
import torchvision.transforms as T
from torchvision.io import read_video  # CPU video decode, then transfer to GPU

# Decode on CPU, then move to GPU tensor
# Requires: pip install torchvision[video]
frames, _, _ = read_video("clip.mp4", output_format="TCHW")
frames = frames.to("cuda").float() / 255.0  # (T, C, H, W) on GPU

# Batch preprocess on GPU
transform = T.Compose([
    T.Resize((640, 640)),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Process all frames as a batch — no loop needed
frames_preprocessed = transform(frames)  # GPU, vectorized

# Run inference on all frames at once
with torch.no_grad():
    with torch.amp.autocast("cuda"):  # PyTorch 2.0+ API
        predictions = model(frames_preprocessed)

GPU vision use cases by industry

Industry Workload Why GPU
Autonomous vehicles Multi-camera real-time detection 30 FPS × 8 cameras × 10 models simultaneously
Medical imaging 3D MRI/CT segmentation Large 3D volumes, batch processing overnight
Retail analytics Store camera foot-traffic 50+ camera streams, on-prem edge GPU
Satellite imagery Change detection over regions Terabytes of images, batch classification
Manufacturing QA Defect detection on assembly line Sub-10ms latency requirement, GPU achieves 2ms

9. "Monte Carlo on a Million Paths" — GPU for Scientific Computing

Scientific and quantitative workloads — simulations, numerical methods, financial modeling — often map cleanly to GPU architecture because they involve the same operation applied to many independent data points.

CuPy: NumPy on GPU, drop-in replacement

import numpy as np
import cupy as cp  # pip install cupy-cuda12x
import time

# CPU version
x_cpu = np.random.randn(10_000_000).astype(np.float32)
start = time.time()
result_cpu = np.fft.fft(x_cpu)
print(f"CPU FFT: {time.time() - start:.3f}s")  # ~0.8s

# GPU version — identical code, just cp instead of np
x_gpu = cp.asarray(x_cpu)  # copy to GPU once
start = time.time()
result_gpu = cp.fft.fft(x_gpu)  # GPU computation
cp.cuda.Stream.null.synchronize()
print(f"GPU FFT: {time.time() - start:.3f}s")  # ~0.012s (66x faster)

# CuPy array behaves like numpy array
print(result_gpu.shape)  # (10000000,)
print(type(result_gpu))  # cupy.ndarray
result_back = cp.asnumpy(result_gpu)  # back to CPU if needed

Monte Carlo option pricing: 100x faster

Monte Carlo simulation is a textbook GPU workload: simulate millions of independent random paths, then aggregate. Perfect SIMT fit — no dependencies between paths.

import cupy as cp

def monte_carlo_option_gpu(S0, K, r, sigma, T, n_paths=1_000_000, n_steps=252):
    """
    European call option pricing via Monte Carlo.
    S0: initial stock price
    K: strike price
    r: risk-free rate
    sigma: volatility
    T: time to expiry (years)
    n_paths: number of simulation paths
    n_steps: daily steps
    """
    dt = T / n_steps

    # Generate all random numbers at once on GPU
    # Shape: (n_paths, n_steps) — each row is one simulation path
    Z = cp.random.standard_normal((n_paths, n_steps), dtype=cp.float32)

    # Compute all paths simultaneously (vectorized, no Python loop)
    log_returns = (r - 0.5 * sigma**2) * dt + sigma * cp.sqrt(dt) * Z
    # Cumulative sum along time dimension = log price path
    log_prices = cp.log(S0) + cp.cumsum(log_returns, axis=1)
    final_prices = cp.exp(log_prices[:, -1])

    # Payoff for each path
    payoffs = cp.maximum(final_prices - K, 0.0)

    # Discount and average
    option_price = cp.exp(-r * T) * cp.mean(payoffs)

    return float(option_price)

# Price an option: S=100, K=105, r=5%, sigma=20%, T=1yr
price = monte_carlo_option_gpu(100, 105, 0.05, 0.20, 1.0)
print(f"Option price: ${price:.4f}")

# Timing (A100):
# GPU (1M paths): ~0.03 seconds
# CPU (NumPy, 1M paths): ~3.1 seconds (~100x faster on GPU)

Other scientific domains benefiting from GPU

Domain Workload GPU library Typical speedup
Molecular dynamics Protein folding simulations GROMACS, OpenMM 50–100x
Weather modeling Atmospheric simulation CUDA Fortran, cuSPARSE 10–30x
Computational fluid dynamics Flow simulation, FEM CUDA, AmgX 20–50x
Seismic processing Subsurface imaging cuFFT, custom CUDA 30–100x
Financial risk VaR, CVA, Monte Carlo CuPy, custom CUDA 50–200x
Genomics Sequence alignment (PARABRICKS) NVIDIA Clara Parabricks 50x (hours → minutes)
When CPU wins for scientific computing
Sequential simulations (where step N depends on step N-1), small-scale problems (<100K elements), or I/O-bound workflows (reading from disk between every step) won't benefit. A climate model that's one big coupled ODE system is harder to GPU-accelerate than a Monte Carlo with independent paths.

10. NVIDIA's Full Stack

NVIDIA is unusual in that it sells not just hardware but a tightly integrated hardware-to-application stack. Understanding the layers helps you know where your code sits and which NVIDIA tools are relevant to you.

NVIDIA Stack: Hardware to Application ┌─────────────────────────────────────────────────────────────────────┐ │ APPLICATIONS │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────────────┐ │ │ │ Your ML │ │ vLLM / │ │ RAPIDS │ │ Omniverse / │ │ │ │ training │ │ NIM │ │ pipeline │ │ simulation │ │ │ └────────────┘ └────────────┘ └────────────┘ └──────────────────┘ │ │ │ │ FRAMEWORKS │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────────────┐ │ │ │ PyTorch │ │ TensorRT │ │ RAPIDS │ │ Triton (lang) │ │ │ │ (cuDNN) │ │ -LLM │ │ (cuDF…) │ │ OpenAI kernel │ │ │ └────────────┘ └────────────┘ └────────────┘ └──────────────────┘ │ │ │ │ LIBRARIES (CUDA ecosystem, 20+ years of optimized kernels) │ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌───────┐ │ │ │cuDNN │ │cuBLAS│ │cuFFT │ │ NCCL │ │cuSPAR│ │cuRand│ │cuSolver│ │ │ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └───────┘ │ │ │ │ CUDA (the foundation — C/C++/Python, CUDA C, PTX) │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ CUDA Runtime + Driver API + nvcc compiler + tooling │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ HARDWARE │ │ ┌────────────────┐ ┌─────────────┐ ┌───────────┐ ┌─────────────┐ │ │ │ H100/H200/B200│ │ RTX 4090/ │ │ Jetson │ │ DGX / HGX │ │ │ │ (data center) │ │ 5090 │ │ (edge) │ │ (clusters) │ │ │ └────────────────┘ └─────────────┘ └───────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────────────────────┘

Hardware: the GPU lineup

GPU Tier VRAM FP16 TFLOPS Memory BW Best use case
H100 SXM5 Current flagship Data center 80 GB HBM3 989 3.35 TB/s LLM training/inference, large-scale ML
B200 2025 Data center 192 GB HBM3e 2,250 8 TB/s Next-gen LLM, future workloads
A100 Data center 40/80 GB HBM2e 312 * 2 TB/s Most production ML today
L4 Data center (inference) 24 GB GDDR6 242 300 GB/s Efficient inference, video transcoding
RTX 4090 Desktop/workstation 24 GB GDDR6X 330 1 TB/s Research, local LLM dev, gaming
T4 Data center (budget) 16 GB GDDR6 130 320 GB/s Free Colab tier, affordable inference
Jetson Orin Edge 64 GB unified 275 (INT8) 204 GB/s Autonomous vehicles, robotics, IoT

* A100 312 TFLOPS FP16 figure is with 2:4 structured sparsity enabled. Dense FP16 throughput is ~77 TFLOPS. Sparsity requires the model's weight matrices to be pruned to the 2:4 sparse pattern before benefiting.

NVLink and NVSwitch: multi-GPU fabric

PCIe (the slot that connects GPU to CPU) runs at ~64 GB/s bidirectional. That's too slow when 8 GPUs need to share gradients during training. NVLink is a direct GPU-to-GPU interconnect; NVLink 4.0 provides 900 GB/s total per GPU (18 links × 50 GB/s bidirectional each). NVSwitch is a chip that creates an all-to-all mesh — any GPU can talk to any GPU at full NVLink bandwidth. An HGX H100 system (8 GPUs + NVSwitches) has 7.2 TB/s of GPU-to-GPU bandwidth, making it effectively one logical compute unit.

CUDA: the foundation

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and API, first released in 2007. It's what makes GPUs programmable for general computation (not just graphics). Everything else in the stack runs on top of CUDA.

Key insight: you almost never write CUDA directly. PyTorch, cuDF, and TensorRT have already written thousands of highly optimized CUDA kernels. You use their Python/C++ APIs and get GPU acceleration automatically.

Key libraries you'll encounter

Library What it does Used by
cuDNN Optimized primitives for deep learning (conv, pool, norm) PyTorch, TensorFlow
cuBLAS GPU-accelerated BLAS (matrix multiply at its core) Everything that does linear algebra
cuFFT Fast Fourier Transform on GPU Signal processing, audio ML
NCCL GPU collective comms (all-reduce, broadcast) Multi-GPU/multi-node training
cuSPARSE Sparse matrix operations on GPU GNNs, sparse models, scientific sim
Thrust GPU parallel algorithms (sort, reduce, scan) cuDF under the hood
Triton (OpenAI) Python-based custom GPU kernel language Flash Attention, custom ops

11. When NOT to Use a GPU Critical

GPU cargo-culting is real. I've seen engineers reach for GPU acceleration for web scraping, small CSV processing, and microservices with 10 req/s — and make their system slower, more expensive, and harder to operate. Here's when to stay on CPU.

Not every problem needs a GPU
GPUs introduce latency (kernel launch overhead), complexity (VRAM management, driver versions), and cost. Only use them when the throughput gain justifies these trade-offs.

1. I/O-bound workloads

If your bottleneck is network latency, database queries, or disk reads, the GPU sits idle 95% of the time. You're paying for a Ferrari to wait at red lights.

Example: A web scraper that fetches 100 URLs, extracts text, calls an external API, and saves results. Bottleneck: network I/O (100–500ms per URL). GPU utilization: ~0%.

Test: Run nvidia-smi while your job is running. If GPU utilization is below 50% most of the time, you're I/O-bound.

2. Small datasets

The CPU-to-GPU PCIe transfer costs real time. For a 10 MB DataFrame, the transfer itself takes ~0.3ms. The GPU computation might take 1ms. But loading the same 10 MB from L3 cache into CPU cores takes ~0.01ms, and the computation might take 5ms. GPU total: 1.3ms. CPU total: 5ms. At this scale, GPU wins — but barely, and only if your data is already warm.

As a rough rule: if your dataset fits comfortably in CPU L3 cache (tens of MB), or if transfer time is more than 30% of your total compute time, GPU probably doesn't help.

3. Branch-heavy, control-flow-heavy code

Code with many if/else branches that different data items take different paths through is a poor fit. Warp divergence means every thread in a warp must execute both branches serially.

Example: Decision tree inference (many branches based on feature values). GPU implementations exist but are complex; CPU often wins for single-tree inference.

4. Sequential algorithms

Algorithms where step N depends on the output of step N-1 can't be parallelized. A linked list traversal, a recursive depth-first search, a sequential state machine — these run on a single thread and gain nothing from GPU parallelism.

Rule of thumb: if your algorithm can't be expressed as "apply this function to all elements independently," it probably doesn't GPU-accelerate well.

5. Microsecond-latency requirements

CUDA kernel launch has overhead: ~5–10 microseconds just to start a kernel, even a trivial one. For high-frequency trading, real-time control systems, or anything where sub-100μs response is required, this overhead is unacceptable.

HFT, for example: stays almost entirely on CPU + FPGA. The determinism and sub-microsecond latency of CPU cache-resident code is irreplaceable.

6. Sporadic, low-frequency jobs

If your batch job runs once per day and takes 15 minutes, renting a GPU for 15 minutes ($0.05) vs CPU for 2 hours ($0.20) saves $0.15/day — $55/year. Not worth the operational complexity of managing GPU instances unless you have many such jobs.

Decision tree: should you use a GPU?

Should I use a GPU? Is your workload data-parallel? (same op on many independent items) │ No ──┤── Stay on CPU. GPU won't help. │ Yes │ Is your data large enough? (>100MB, or >100K vectors) │ No ──┤── CPU is fast enough. Transfer overhead not worth it. │ Yes │ Is your bottleneck I/O (network/disk), not compute? │ Yes ──┤── Fix I/O first. GPU won't help. │ No │ Do you need sub-millisecond latency? │ Yes ──┤── CPU only. GPU kernel launch overhead ~5-10μs. │ No │ Is the algorithm sequential (step N depends on step N-1)? │ Yes ──┤── Hard to parallelize. Stay on CPU. │ No │ ✓ GPU will likely help. Start with a high-level library (cudf.pandas, PyTorch, vLLM) before writing any CUDA.

12. Your First GPU Program

The fastest path from zero to running GPU code, depending on your background.

Easiest: Google Colab (free, no setup)

# In Google Colab:
# Runtime → Change runtime type → T4 GPU → Save

# Verify you have a GPU
import torch
print(torch.cuda.is_available())        # True
print(torch.cuda.get_device_name(0))    # NVIDIA T4

import subprocess
subprocess.run(["nvidia-smi"])           # Shows GPU memory and utilization

Data engineer: cudf.pandas in 5 minutes

# On Colab with GPU runtime:
!pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com -q
import cudf.pandas
cudf.pandas.install()

import pandas as pd
import numpy as np

# Generate 10M row dataset
n = 10_000_000
df = pd.DataFrame({
    "user_id": np.random.randint(0, 100000, n),
    "amount": np.random.randn(n) * 100,
    "category": np.random.choice(["A", "B", "C", "D"], n),
})

# This now runs on GPU — same pandas API
result = df.groupby(["user_id", "category"]).agg(
    total=("amount", "sum"),
    avg=("amount", "mean"),
    count=("amount", "count"),
).reset_index()

print(result.head())
print(f"Rows in result: {len(result):,}")

ML engineer: MNIST on GPU (30 minutes)

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Data
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])
train_data = datasets.MNIST(".", train=True, download=True, transform=transform)
loader = DataLoader(train_data, batch_size=256, shuffle=True, num_workers=2)

# Model
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(784, 256), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(128, 10)
).to(device)

optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

# Train 5 epochs
for epoch in range(5):
    correct = 0
    for x, y in loader:
        x, y = x.to(device), y.to(device)  # ← GPU transfer
        optimizer.zero_grad()
        out = model(x)
        loss = criterion(out, y)
        loss.backward()
        optimizer.step()
        correct += (out.argmax(1) == y).sum().item()
    print(f"Epoch {epoch+1}: {correct/len(train_data)*100:.1f}% accuracy")

# On GPU: ~15 seconds total. On CPU: ~3 minutes.

LLM engineer: vLLM quickstart

# Requires GPU with at least 16 GB VRAM (A10G, A100, or L4)
pip install vllm

# Serve Mistral 7B
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.3 \
    --dtype bfloat16 \
    --max-model-len 4096

# Test it
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"mistralai/Mistral-7B-Instruct-v0.3",
         "messages":[{"role":"user","content":"What is CUDA?"}],
         "max_tokens":200}'

Cloud options for getting a GPU

Platform GPU Price Best for
Google Colab T4 (free), A100 (Pro) Free / $10/mo Learning, prototyping
Kaggle Notebooks T4 / P100 Free (30 hrs/wk) Learning, competitions
Lambda Labs A10, A100, H100 $0.50–$3.50/hr Training, serious dev
Vast.ai Wide variety $0.20–$2/hr Budget training
AWS (g4dn.xlarge) T4 $0.53/hr on-demand, $0.16 spot Production inference
GCP (a2-highgpu-1g) A100 40 GB $3.67/hr on-demand, $1.10 spot Production training

13. GPU Memory Management

The number one source of production GPU headaches. Unlike system RAM (which can use swap to extend beyond physical limits), GPU VRAM is strictly bounded. Exceed it and you get an out-of-memory error, not a slowdown.

The error every ML engineer knows

# The most common GPU error you will encounter:
# RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
# (GPU 0; 15.90 GiB total capacity; 13.47 GiB already allocated)

This happens because you tried to allocate more memory than the GPU has available. Common causes: batch size too large, model too large for your GPU, memory not being freed from previous iterations, or multiple processes sharing the same GPU.

How GPU memory gets used in a training job

import torch

# Check current GPU memory usage
def print_memory_stats():
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    print(f"Allocated: {allocated:.2f} GB | Reserved: {reserved:.2f} GB")

# Memory is consumed by:
# 1. Model parameters (fixed, based on model size)
# 2. Gradients (same size as parameters, during backward)
# 3. Optimizer states (2x parameters for Adam: m + v)
# 4. Activations (proportional to batch size × model depth)
# 5. PyTorch's own allocator overhead

# For a 7B parameter model in FP16:
# Parameters:     7B × 2 bytes = 14 GB
# Gradients:                   = 14 GB
# Adam optimizer:              = 28 GB  (2 states × 14 GB)
# Total minimum:               = 56 GB  ← why H100 80GB is popular

Strategies to reduce memory usage

import torch
from torch.amp import autocast  # PyTorch 2.0+ API

# Strategy 1: Reduce batch size (most direct lever)
# batch_size = 256 → OOM
# batch_size = 32 → works. Use gradient accumulation to simulate large batch:
accumulation_steps = 8
optimizer.zero_grad()
for i, (x, y) in enumerate(loader):
    x, y = x.to(device), y.to(device)
    with autocast("cuda"):
        output = model(x)
        loss = criterion(output, y) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# Strategy 2: Mixed precision (halves activation memory)
# FP32 → FP16: 2x memory reduction for activations
with autocast("cuda", dtype=torch.bfloat16):
    output = model(x)

# Strategy 3: Gradient checkpointing (trade compute for memory)
# Recomputes activations during backward instead of storing them
# ~30% slower but saves 60-70% activation memory
from torch.utils.checkpoint import checkpoint
output = checkpoint(model.layers[0], x)  # per layer

# For transformer models (HuggingFace):
model.gradient_checkpointing_enable()

# Strategy 4: Clear cache between phases
torch.cuda.empty_cache()  # releases unused reserved memory (not allocated)
del tensor_you_no_longer_need  # explicit deletion + Python GC

Monitoring GPU memory

# Watch GPU stats in terminal (refresh every 0.5s)
watch -n 0.5 nvidia-smi

# One-liner for memory summary
nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu \
           --format=csv

# Install gpustat for a nicer view
pip install gpustat
gpustat --watch  # colored, per-process breakdown
# Detailed per-tensor memory breakdown (great for debugging OOM)
print(torch.cuda.memory_summary(device=None, abbreviated=False))

# Profile memory during training to find the leak
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA],
    profile_memory=True
) as prof:
    output = model(x)
    loss = criterion(output, y)
    loss.backward()
print(prof.key_averages().table(sort_by="cuda_memory_usage"))

PCIe bottleneck: minimize CPU↔GPU transfers

# BAD: transferring small tensors one at a time
for item in large_list:
    tensor = torch.tensor(item).to("cuda")  # PCIe transfer per item
    result = model(tensor)

# GOOD: batch your transfers
batch = torch.tensor(large_list)  # one large CPU tensor
batch_gpu = batch.to("cuda")      # one transfer — 32 GB/s vs 32x overhead
results = model(batch_gpu)

# GOOD: pin memory for faster CPU→GPU transfers
loader = DataLoader(
    dataset,
    batch_size=256,
    pin_memory=True,      # allocates page-locked memory (~1.5x faster transfers)
    num_workers=4,        # load CPU data in background while GPU computes
)
Multi-GPU memory strategies
When a model is too large for one GPU's VRAM, you have three options:
  • Data parallelism (DDP): same model on each GPU, different data shards. Gradients averaged. Requires model fits on one GPU.
  • Tensor parallelism: split individual matrices across GPUs. Used by Megatron-LM. Complex to implement.
  • Pipeline parallelism: different transformer layers on different GPUs. DeepSpeed ZeRO handles this. Middle ground.

14. Cost & ROI

Before renting a GPU, you need numbers. Here are real ones.

GPU cloud pricing (2026)

GPU VRAM On-demand ($/hr) Spot ($/hr) 1-yr reserved ($/hr) Provider
T4 16 GB $0.35–0.53 $0.10–0.16 $0.22 Lambda, AWS, GCP
A10G 24 GB $0.75–1.10 $0.25–0.40 $0.45 AWS, Lambda
A100 (40 GB) 40 GB $2.00–3.00 $0.60–1.00 $1.20 Lambda, GCP, AWS
A100 (80 GB) 80 GB $2.50–4.00 $0.75–1.50 $1.50 Lambda, Azure, GCP
H100 (80 GB) 80 GB $3.50–7.00 $1.50–3.00 $2.00–2.50 CoreWeave, Lambda, Azure
H100 (8x DGX) 640 GB $28–56/hr $12–24/hr $16–20/hr CoreWeave, GCP, Azure

ROI scenarios

Workload CPU cost GPU cost GPU savings Break-even
LLM inference (1M tokens/day, Llama 8B) $12.00/day (CPU cluster) $0.50/day (T4, spot) 96% cheaper Immediate
Data processing (1 TB/day ETL) $18/day (large CPU instance) $3/day (8x T4 spot, 45 min) 83% cheaper Immediate
ML training (1B param model) $240 (CPU, 4 days) $18 (A100, 6 hrs) 92% cheaper Immediate
Batch image processing (10M images/day) $8/day $1.50/day 81% cheaper Immediate
Small analytics job (10 GB, twice/week) $0.40/week (c5.2xlarge) $0.80/week (G4dn, setup overhead) CPU is 2x cheaper Never

When GPU is NOT worth it (cost perspective)

Cost optimization tips

15. The Vera Rubin Generation (GTC 2026) 2026

NVIDIA's GTC 2026 announcements define the direction for the next 2–3 years. Here's what's coming and what it means for you.

Blackwell vs. Vera Rubin: architecture timeline

These three product names are often conflated — here's the precise breakdown:

The 192 GB HBM3e VRAM spec applies to the B200 (standalone) and the GB200 (Grace Blackwell) package today. Vera Rubin will bring further capacity and bandwidth increases.

BlueField-4 DPU

The BlueField-4 DPU (Data Processing Unit) offloads networking and security operations from the CPU/GPU. In a dense GPU cluster, a significant portion of CPU time is consumed by network protocol handling, encryption, and storage I/O. DPUs move this to dedicated silicon, freeing the GPU and CPU for computation. For ML clusters, this means higher effective GPU utilization at scale.

NVLink at rack scale

With NVLink Switch Systems, NVIDIA is scaling from 8 GPUs connected (HGX today) to entire racks and eventually buildings of GPUs acting as a unified compute fabric. The NVLink 5 system (announced 2026) connects up to 576 GPUs with 1.8 TB/s per-GPU bandwidth — enough to train models too large to fit on any single cluster today.

"Huang's Law"

Unlike Moore's Law (transistor density), Huang's Law describes GPU performance gains from three compounding factors: new compute architectures, increased memory bandwidth, and improved interconnects. NVIDIA has delivered roughly 1,000x AI performance improvement per decade — faster than traditional semiconductor scaling.

Platform integration: where GPUs are appearing

Platform GPU integration For you
Snowflake GPU-accelerated queries (RAPIDS inside) SQL queries run on GPU, no config change
Databricks GPU clusters for ML, Spark RAPIDS Check "GPU enabled" when creating cluster
Google BigQuery GPU-accelerated ML functions ML.PREDICT can use GPU backend
AWS SageMaker Managed GPU training + inference Select GPU instance type in training job config
Azure ML GPU compute clusters, NVIDIA AI Enterprise Standard for enterprise ML in Azure ecosystem

The implication for your career: GPU skills are transitioning from "ML specialist" to "data engineer and backend engineer baseline knowledge." The platforms you already use are quietly adding GPU acceleration under the hood, and understanding the model helps you use them effectively.

16. Key Terminology Glossary

Compute concepts

Term Definition
CUDA core Basic FP32 arithmetic unit on NVIDIA GPU. H100 has 16,896. Not the same as a CPU core — much simpler, designed for throughput not latency.
Tensor Core Specialized unit for matrix multiply-accumulate in mixed precision (FP16/BF16/INT8/FP8). Much faster than CUDA cores for deep learning. H100 has 528 Tensor Cores.
SM (Streaming Multiprocessor) The major organizational unit of a GPU — like a CPU core, but with many more arithmetic units. H100 has 132 SMs, each with 128 CUDA cores.
Warp Group of 32 threads that execute the same instruction in lock-step (SIMT). The fundamental scheduling unit of a GPU SM.
Block Programmer-defined group of 1–1,024 threads that share shared memory and can synchronize. Scheduled to run on a single SM.
Grid The full set of blocks launched for a single kernel call. Can contain millions of blocks.
Kernel A function executed on the GPU by many threads simultaneously. Written in CUDA C/C++ or Triton, called from host (CPU) code.
Occupancy Ratio of active warps to maximum possible warps on an SM. Higher occupancy generally means better latency hiding. Low occupancy = underutilized GPU.
Kernel fusion Combining multiple GPU operations into a single kernel to reduce memory round-trips. Key optimization in TensorRT and FlashAttention.

Memory concepts

Term Definition
HBM (High Bandwidth Memory) 3D-stacked DRAM used as GPU VRAM. HBM3 achieves 3.35 TB/s bandwidth — 10x+ faster than DDR5 system RAM.
Shared memory Fast on-chip SRAM shared by all threads in a block. ~96 KB on H100. The programmer controls what goes here. Think of it as a manually managed L1 cache.
Pinned memory CPU RAM that is page-locked and not swappable. CPU→GPU transfers from pinned memory are ~2x faster than pageable memory.
Memory coalescing When 32 threads in a warp access consecutive memory addresses, they're served in a single memory transaction. Non-coalesced access (scattered addresses) requires multiple transactions — major performance penalty.
NVLink Direct GPU-to-GPU interconnect, bypassing PCIe. NVLink 4.0: 900 GB/s. Used in multi-GPU training to share gradients efficiently.
NVSwitch Switch chip that creates all-to-all NVLink topology in DGX systems. Every GPU can communicate with every other GPU at full NVLink speed.
PCIe Standard bus connecting GPU to CPU. Gen 5: 64 GB/s. The bottleneck for CPU↔GPU data transfer — 50x slower than NVLink.

Precision formats

Format Bits Range Use case
FP64 64 ±10±308 Scientific computing where precision matters. CPUs are great; GPUs are slower.
FP32 32 ±10±38 Default for training. Safe, less likely to overflow or underflow.
TF32 19 (internally) Same as FP32 NVIDIA internal format for Tensor Cores. Auto-used in PyTorch by default. Same accuracy as FP32, 3x faster.
BF16 16 Same range as FP32 Preferred for LLM training. Same exponent as FP32 (no overflow), lower mantissa precision. Supported on A100+.
FP16 16 ±65,504 Mixed precision training on older GPUs (V100, T4). Smaller range than BF16 — requires gradient scaling to prevent underflow.
FP8 8 Limited Inference quantization. H100/H200 support FP8 Tensor Cores natively. ~2x speed vs FP16 with careful calibration.
INT8 8 -128 to 127 Post-training quantization for inference. Up to 4x throughput vs FP32. Requires calibration dataset. TensorRT INT8 is standard.
INT4 4 -8 to 7 Aggressive quantization (llama.cpp, GGUF format). Fits 70B model on 48 GB VRAM. ~1% accuracy loss vs FP16.

Performance metrics

Term Definition
FLOPS Floating-Point Operations Per Second. Measures compute throughput.
TFLOPS Tera-FLOPS = 1012 FLOPS. H100: 67 TFLOPS (FP32), 989 TFLOPS (TF32), 1,979 TFLOPS (FP16).
PFLOPS Peta-FLOPS = 1015 FLOPS. Top500 supercomputer benchmark uses PFLOPS. A DGX H100 (8 GPUs): ~32 PFLOPS (FP16).
MFU (Model FLOPs Utilization) Percentage of theoretical peak FLOPS your training actually achieves. 40–60% MFU is good. Lower means memory-bound or poorly batched.
Quantization Reducing numeric precision of model weights (FP32 → INT8 → INT4) to shrink VRAM usage and increase throughput. Slight accuracy trade-off.

17. Learning Roadmap

A structured path from "never used a GPU" to "production GPU engineer," organized by week with concrete deliverables.

Week 1: First contact — understand the model

Week 2: Real workloads — memory and performance

Week 3: Data and inference acceleration

Week 4: Multi-GPU and production readiness

Resources

Resource Format Best for
NVIDIA DLI: Fundamentals of Accelerated Computing Interactive course (~8 hrs) Hands-on CUDA C from scratch
Programming Massively Parallel Processors (Kirk & Hwu) Textbook Deep architecture understanding
fast.ai Practical Deep Learning Free course Applied ML with GPU, top-down teaching style
Andrej Karpathy: Zero to Hero YouTube series Build GPT from scratch, includes GPU training
Triton Tutorials Code + docs Write custom GPU kernels in Python (no CUDA C)
PyTorch Official Tutorials Interactive PyTorch fundamentals with GPU examples
NVIDIA GTC recordings (YouTube) Talks (30–60 min each) Latest techniques, architecture announcements
The fastest path: use before understanding
You don't need to understand SIMT architecture to add .to("cuda") to a PyTorch training loop. Start with the practical use cases (Section 4–8) that apply to your current work. Come back to the architecture sections when you hit performance problems and need to diagnose why.