Deep Learning Refresher — CNNs, RNNs, Attention

Table of Contents

From Logistic Regression to Neural Networks

If you understand logistic regression, you already understand a single neuron. Logistic regression takes inputs, multiplies by weights, adds a bias, and passes through a sigmoid to output a probability. A neuron does exactly that.

So why do we need more? Because a single neuron can only draw a straight line (or hyperplane) to separate classes. Consider the XOR problem: points at (0,0) and (1,1) are class 0, points at (0,1) and (1,0) are class 1. No single straight line can separate them. But stack two neurons together and it's trivial.

Network Architecture

Think of a neural network as an assembly line. Raw materials (pixels, words, numbers) enter at one end. Each station (layer) transforms them, extracting increasingly abstract features. At the other end, out comes a prediction.

Input layer: your raw features. For a 64x64 image, that's 12,288 numbers (64 × 64 × 3 color channels).
Hidden layers: where the learning happens. Layer 1 might detect edges, layer 2 combines edges into shapes, layer 3 recognizes objects.
Output layer: the final answer. One sigmoid neuron for binary classification, or a softmax layer for multi-class.

Notation

Following Ng's convention (superscript in brackets means layer):

Symbol	Meaning	Shape
W^[l]	Weight matrix for layer l	(n^[l], n^[l-1])
b^[l]	Bias vector for layer l	(n^[l], 1)
z^[l]	Linear output: W^[l]a^[l-1] + b^[l]	(n^[l], 1)
a^[l]	Activation output: g(z^[l])	(n^[l], 1)
a^[0]	Input features (same as x)	(n^[0], 1)

Why depth matters more than width

A shallow network with 1000 neurons in one hidden layer can theoretically approximate any function (Universal Approximation Theorem). But in practice, a deep network with 10 layers of 100 neurons learns hierarchical features far more efficiently. It's the difference between memorizing every possible face vs. learning "edges → eyes/nose/mouth → face structure." Depth enables compositionality.

Forward Propagation

Forward prop is "run the network": push inputs through each layer, left to right, until you get a prediction. Let's trace through a concrete example.

Step by Step

A 2-layer network: 3 inputs → 4 hidden neurons (ReLU) → 1 output (sigmoid).

import numpy as np

# --- Layer by layer ---
# Input: x = [1.0, 0.5, -0.3]

x = np.array([[1.0], [0.5], [-0.3]])  # shape (3, 1)

# Layer 1: Linear + ReLU
W1 = np.array([[ 0.2, -0.1,  0.4],
               [ 0.5,  0.3, -0.2],
               [-0.3,  0.1,  0.6],
               [ 0.1,  0.4,  0.2]])  # shape (4, 3)
b1 = np.array([[0.1], [-0.1], [0.0], [0.2]])  # shape (4, 1)

z1 = W1 @ x + b1    # Linear: z = Wx + b
# z1 = [[0.05], [0.71], [-0.02], [0.44]]

a1 = np.maximum(0, z1)  # ReLU: max(0, z)
# a1 = [[0.05], [0.71], [0.00], [0.44]]  (negative value clipped to 0)

# Layer 2: Linear + Sigmoid
W2 = np.array([[0.3, -0.2, 0.5, 0.1]])  # shape (1, 4)
b2 = np.array([[-0.1]])                   # shape (1, 1)

z2 = W2 @ a1 + b2
# z2 = 0.3*0.05 + (-0.2)*0.71 + 0.5*0.00 + 0.1*0.44 + (-0.1) = -0.183

a2 = 1 / (1 + np.exp(-z2))  # Sigmoid
# a2 ≈ 0.454 → prediction: probability of class 1

print(f"z1 = {z1.flatten()}")
print(f"a1 = {a1.flatten()}")
print(f"z2 = {z2.flatten()}")
print(f"a2 = {a2.flatten():.4f}")  # ≈ 0.4544

Vectorization Across Examples

In practice you don't process one example at a time. Stack all m examples into a matrix X of shape (n, m), and the same code processes them all simultaneously:

def forward_propagation(X, parameters):
    """
    X: shape (n_features, m_examples)
    parameters: dict with W1, b1, W2, b2, ...
    Returns: predictions and cache for backprop
    """
    cache = {'A0': X}
    A = X
    L = len(parameters) // 2  # number of layers

    for l in range(1, L + 1):
        W = parameters[f'W{l}']
        b = parameters[f'b{l}']
        Z = W @ A + b                     # broadcasting handles (n, 1) + (n, m)

        if l == L:
            A = 1 / (1 + np.exp(-Z))      # sigmoid for output layer
        else:
            A = np.maximum(0, Z)           # ReLU for hidden layers

        cache[f'Z{l}'] = Z
        cache[f'A{l}'] = A

    return A, cache

Vectorize, don't loop

Explicit for-loops over training examples are 100x slower than matrix operations. Numpy's matrix multiply @ uses optimized BLAS routines. Always think in matrices: stack examples as columns, and one matrix multiply processes the entire batch.

Backpropagation

Backprop answers: "if I wiggle this weight slightly, how much does the loss change?" That's the gradient — and once you have it, you can nudge weights to reduce the loss.

The Chain Rule Intuition

Think of the network as a chain of functions: input → layer 1 → layer 2 → ... → loss. The chain rule says: to find how a change in an early weight affects the final loss, multiply the "sensitivity" at each link in the chain.

Analogy: imagine a row of dominoes. Pushing the first one harder (changing a weight) affects how hard the last one falls (the loss). The chain rule computes exactly how much, by multiplying the transfer at each step.

The Key Equations

For a network with L layers, working backward from the output:

def backward_propagation(Y, cache, parameters):
    """
    Y: true labels, shape (1, m)
    cache: from forward_propagation
    parameters: W, b for each layer
    Returns: gradients dict
    """
    m = Y.shape[1]
    L = len(parameters) // 2
    gradients = {}

    # Output layer gradient (binary cross-entropy + sigmoid)
    AL = cache[f'A{L}']
    dZ = AL - Y  # This elegant simplification comes from calculus

    for l in range(L, 0, -1):
        A_prev = cache[f'A{l-1}']

        # Gradients for weights and biases
        gradients[f'dW{l}'] = (1 / m) * (dZ @ A_prev.T)
        gradients[f'db{l}'] = (1 / m) * np.sum(dZ, axis=1, keepdims=True)

        if l > 1:
            W = parameters[f'W{l}']
            dA_prev = W.T @ dZ

            # ReLU derivative: 1 if z > 0, else 0
            Z_prev = cache[f'Z{l-1}']
            dZ = dA_prev * (Z_prev > 0).astype(float)

    return gradients

You don't need to memorize these

Frameworks (PyTorch, TensorFlow) compute gradients automatically via autograd. But understanding the chain rule helps you debug when gradients are NaN, zero, or exploding. If you know the data flows forward and gradients flow backward, you can reason about where things break.

Full derivation of backprop for a 2-layer network

For binary cross-entropy loss: L = -(1/m) Σ [y log(a) + (1-y) log(1-a)]

Starting from the output (layer 2, sigmoid activation):

dZ^[2] = A^[2] - Y — derivative of loss w.r.t. z^[2]
dW^[2] = (1/m) dZ^[2] · A^[1]ᵀ — how loss changes w.r.t. W^[2]
db^[2] = (1/m) Σ dZ^[2] — average over examples
dA^[1] = W^[2]ᵀ · dZ^[2] — propagate gradient backward through the linear layer
dZ^[1] = dA^[1] * g'(Z^[1]) — chain rule through activation. For ReLU, g'(z) = 1 if z > 0, else 0
dW^[1] = (1/m) dZ^[1] · Xᵀ
db^[1] = (1/m) Σ dZ^[1]

Notice the pattern: at each layer, compute dZ, then dW, db, and dA for the previous layer. It's the same three equations repeated.

Activation Functions

Without activation functions, a deep network is just a stack of linear transformations — which collapses into one big linear transformation. Activations introduce non-linearity, giving the network the ability to learn curves, not just lines.

Sigmoid

σ(z) = 1 / (1 + e⁻ᶻ) — squashes to (0, 1).

Good for: output layer of binary classification (gives a probability). Bad for: hidden layers of deep networks. Why?

Vanishing gradients: when z is very large or very small, σ'(z) ≈ 0. Gradients die as they propagate backward through many sigmoid layers.
Not zero-centered: outputs are always positive, causing zig-zag updates in gradient descent.

Tanh

tanh(z) = (eᶻ - e⁻ᶻ) / (eᶻ + e⁻ᶻ) — squashes to (-1, 1).

Better than sigmoid for hidden layers because it's zero-centered. But still suffers from vanishing gradients at the extremes.

ReLU

ReLU(z) = max(0, z) — the "if positive, pass through; if negative, kill it" function.

This seemingly trivial function revolutionized deep learning:

No vanishing gradient (for positive values): the derivative is just 1
Computationally cheap: just a comparison, no exponentials
Sparse activation: typically ~50% of neurons output zero, making the network naturally sparse and efficient

The dying ReLU problem

If a neuron's input is always negative (e.g., due to a large negative bias), its gradient is always zero and it never updates. It's "dead." This can happen if the learning rate is too high — a big update pushes weights into a region where the neuron always outputs zero. Fix: use Leaky ReLU or careful initialization.

Leaky ReLU & Variants

Leaky ReLU(z) = max(αz, z) where α is small (e.g., 0.01). Negative inputs get a small gradient instead of zero — neurons can't die.

Softmax

For multi-class classification: converts a vector of raw scores into probabilities that sum to 1.

softmax(zᵢ) = eᶻⁱ / Σⱼ eᶻʲ

def softmax(z):
    # Subtract max for numerical stability (prevents overflow)
    exp_z = np.exp(z - np.max(z, axis=0, keepdims=True))
    return exp_z / np.sum(exp_z, axis=0, keepdims=True)

# Example: 3-class output
scores = np.array([[2.0], [1.0], [0.1]])
probs = softmax(scores)
print(probs.flatten())  # [0.659, 0.242, 0.099] — sums to 1.0

Comparison Table

Activation	Range	Pros	Cons	Use For
Sigmoid	(0, 1)	Probability output	Vanishing gradient	Binary output layer
Tanh	(-1, 1)	Zero-centered	Vanishing gradient	RNN hidden states
ReLU	[0, ∞)	Fast, no vanishing grad	Dying neurons	Default for hidden layers
Leaky ReLU	(-∞, ∞)	No dying neurons	Extra hyperparameter α	When ReLU neurons die
GELU	(-0.17, ∞)	Smooth, used in transformers	Slower to compute	Transformer models
Softmax	(0, 1), sum=1	Multi-class probabilities	Only for output layer	Multi-class output

Never use sigmoid in hidden layers of deep networks

With 10+ layers, gradients through sigmoid activations shrink exponentially toward zero. Training will be impossibly slow or just stall. Use ReLU (or its variants) for hidden layers. This single change is what made training deep networks practical.

Weight Initialization

Initialization is like choosing your starting position on a mountain before hiking to the valley. A bad start means a longer hike or getting stuck on a plateau.

Why Not Zero?

If all weights are zero (or the same value), every neuron in a layer computes the exact same thing. Their gradients are identical, so they update identically. They stay identical forever. This is the symmetry problem — 100 neurons that all do the same thing is just 1 neuron.

Random Initialization

Random breaks symmetry, but the scale matters enormously:

Too large: activations explode through layers. With sigmoid/tanh, outputs saturate at extremes where gradients are zero.
Too small: activations shrink toward zero through layers. Gradients vanish.

Xavier / Glorot Initialization

For sigmoid and tanh activations. The key insight: to keep the variance of activations stable across layers, initialize with:

W ~ N(0, 1/n^[l-1]) — or equivalently, W ~ Uniform(-√(1/n), √(1/n))

where n^[l-1] is the number of inputs to the layer.

He Initialization

For ReLU activations. Since ReLU zeros out half the outputs, you need double the variance:

W ~ N(0, 2/n^[l-1])

import numpy as np

def initialize_parameters(layer_dims):
    """
    layer_dims: list like [784, 128, 64, 10]
    Returns: dict with W1, b1, W2, b2, ...
    """
    parameters = {}
    for l in range(1, len(layer_dims)):
        n_in = layer_dims[l - 1]
        n_out = layer_dims[l]

        # He initialization (for ReLU)
        parameters[f'W{l}'] = np.random.randn(n_out, n_in) * np.sqrt(2 / n_in)
        parameters[f'b{l}'] = np.zeros((n_out, 1))  # biases can be zero

    return parameters

# Example: 784 → 128 → 64 → 10 network
params = initialize_parameters([784, 128, 64, 10])
print(f"W1 std: {params['W1'].std():.4f}")  # ≈ sqrt(2/784) ≈ 0.0505
print(f"W2 std: {params['W2'].std():.4f}")  # ≈ sqrt(2/128) ≈ 0.1250

Rule of thumb

Using ReLU? Use He init. Using sigmoid/tanh? Use Xavier/Glorot. Using a framework? This is the default — but it's worth verifying, especially for custom layers.

Optimization

Gradient descent finds the bottom of the loss landscape. But how you descend matters. Plain gradient descent works but is slow — like walking straight downhill on every step without considering momentum or terrain.

Mini-Batch Gradient Descent

Three flavors:

Method	Batch Size	Pros	Cons
Batch GD	Entire dataset (m)	Stable convergence, low noise	Slow per step, needs all data in memory
Stochastic GD	1 example	Fast updates, can escape local minima	Very noisy, no vectorization benefit
Mini-batch GD	32–512 examples	Best of both: fast + stable	Need to choose batch size

In practice, everyone uses mini-batch. Common sizes: 32, 64, 128, 256. Powers of 2 are slightly faster on GPUs due to memory alignment.

Exponentially Weighted Moving Averages

Before understanding momentum and Adam, you need this building block. An exponentially weighted moving average (EWMA) smooths a noisy sequence:

v_t = β · v_t-1 + (1 - β) · θ_t

β controls how much "memory" the average has. β = 0.9 roughly averages the last 10 values. β = 0.99 averages the last 100. Higher β = smoother but slower to react.

Momentum

Standard gradient descent with no memory: each step depends only on the current gradient. This causes oscillations in dimensions with high curvature — the path zig-zags like a drunk person walking downhill.

Momentum adds "inertia" — a rolling ball analogy. The ball accumulates velocity in the consistent downhill direction and dampens the oscillating directions:

def gradient_descent_with_momentum(parameters, gradients, v, learning_rate, beta=0.9):
    """
    v: velocity (exponentially weighted average of gradients)
    """
    for l in range(1, L + 1):
        # Update velocity: keep rolling + new gradient
        v[f'dW{l}'] = beta * v[f'dW{l}'] + (1 - beta) * gradients[f'dW{l}']
        v[f'db{l}'] = beta * v[f'db{l}'] + (1 - beta) * gradients[f'db{l}']

        # Update parameters using velocity (not raw gradient)
        parameters[f'W{l}'] -= learning_rate * v[f'dW{l}']
        parameters[f'b{l}'] -= learning_rate * v[f'db{l}']

    return parameters, v

RMSProp

Different idea: adapt the learning rate per parameter. Parameters with large gradients get smaller learning rates (to prevent overshooting), and parameters with small gradients get larger learning rates (to speed up).

def rmsprop(parameters, gradients, s, learning_rate, beta=0.999, epsilon=1e-8):
    """
    s: cache of squared gradients (exponentially weighted)
    """
    for l in range(1, L + 1):
        # Accumulate squared gradients
        s[f'dW{l}'] = beta * s[f'dW{l}'] + (1 - beta) * gradients[f'dW{l}'] ** 2
        s[f'db{l}'] = beta * s[f'db{l}'] + (1 - beta) * gradients[f'db{l}'] ** 2

        # Update: divide by sqrt of accumulated squared gradient
        parameters[f'W{l}'] -= learning_rate * gradients[f'dW{l}'] / (np.sqrt(s[f'dW{l}']) + epsilon)
        parameters[f'b{l}'] -= learning_rate * gradients[f'db{l}'] / (np.sqrt(s[f'db{l}']) + epsilon)

    return parameters, s

Adam (Adaptive Moment Estimation)

Momentum + RMSProp combined. Maintains both a velocity (first moment, like momentum) and a cache of squared gradients (second moment, like RMSProp). It's the default optimizer for most deep learning.

def adam(parameters, gradients, v, s, t, learning_rate=0.001,
         beta1=0.9, beta2=0.999, epsilon=1e-8):
    """
    v: first moment (momentum)
    s: second moment (RMSProp)
    t: iteration count (for bias correction)
    """
    for l in range(1, L + 1):
        # Momentum update
        v[f'dW{l}'] = beta1 * v[f'dW{l}'] + (1 - beta1) * gradients[f'dW{l}']
        v[f'db{l}'] = beta1 * v[f'db{l}'] + (1 - beta1) * gradients[f'db{l}']

        # RMSProp update
        s[f'dW{l}'] = beta2 * s[f'dW{l}'] + (1 - beta2) * gradients[f'dW{l}'] ** 2
        s[f'db{l}'] = beta2 * s[f'db{l}'] + (1 - beta2) * gradients[f'db{l}'] ** 2

        # Bias correction (important in early iterations)
        v_corrected_W = v[f'dW{l}'] / (1 - beta1 ** t)
        v_corrected_b = v[f'db{l}'] / (1 - beta1 ** t)
        s_corrected_W = s[f'dW{l}'] / (1 - beta2 ** t)
        s_corrected_b = s[f'db{l}'] / (1 - beta2 ** t)

        # Update parameters
        parameters[f'W{l}'] -= learning_rate * v_corrected_W / (np.sqrt(s_corrected_W) + epsilon)
        parameters[f'b{l}'] -= learning_rate * v_corrected_b / (np.sqrt(s_corrected_b) + epsilon)

    return parameters, v, s

The default starting point

Adam with lr=3e-4 is a great starting point for most problems (Andrej Karpathy's rule, endorsed by Ng). The default β₁=0.9, β₂=0.999, ε=1e-8 rarely need tuning. Start here, then experiment if needed.

Learning Rate Schedules

A fixed learning rate is often suboptimal. Start large (to make progress fast), then reduce (to fine-tune near the optimum).

Schedule	How It Works	When to Use
Step decay	Reduce by factor every N epochs	Simple, predictable
Exponential decay	lr = lr₀ · e^-kt	Smooth decay
Cosine annealing	lr follows a cosine curve to near-zero	Popular for fine-tuning
Warmup + decay	Ramp up lr for first N steps, then decay	Transformers, large batches
One-cycle	Warmup to max, then cosine decay to near-zero	fast.ai's recommended default

fast.ai insight: the learning rate finder

Before training, run a few batches while exponentially increasing the learning rate from very small (1e-7) to very large (10). Plot loss vs. learning rate. The best learning rate is roughly one order of magnitude before the loss starts exploding. This takes seconds and saves hours of guessing. Jeremy Howard's lr_find() implements this — one of the most practical tricks in deep learning.

fast.ai insight: the one-cycle policy

Smith (2018), popularized by fast.ai: start with a small LR, ramp up to a max LR over the first ~30% of training, then cosine-anneal down to near zero. This achieves super-convergence — faster training and often better generalization than fixed or step-decay schedules. Combined with the LR finder, it's the fast.ai default and works remarkably well.

Regularization for Deep Networks

Deep networks have millions of parameters — they can memorize anything. Regularization prevents this by constraining what the network can learn.

L2 Regularization (Weight Decay)

Same idea as in classical ML — add λ·||W||² to the loss. For neural networks, this is often called weight decay because the update rule effectively multiplies weights by (1 - α·λ/m) each step, "decaying" them toward zero.

# In the cost function, add L2 penalty
def compute_cost_with_l2(AL, Y, parameters, lambd):
    m = Y.shape[1]
    cross_entropy = -(1 / m) * np.sum(Y * np.log(AL) + (1 - Y) * np.log(1 - AL))

    l2_penalty = 0
    L = len(parameters) // 2
    for l in range(1, L + 1):
        l2_penalty += np.sum(parameters[f'W{l}'] ** 2)
    l2_penalty *= lambd / (2 * m)

    return cross_entropy + l2_penalty

# In backprop, add regularization gradient: dW += (lambd/m) * W

Dropout

During training, randomly "turn off" each neuron with probability p. The remaining neurons must learn to be useful on their own, without relying on specific co-activations with other neurons.

Analogy: a team where any member might be absent on any day. Each person must be capable independently — no one can rely on a specific colleague always being there.

Inverted Dropout (Ng's preferred form)

def forward_with_dropout(A_prev, keep_prob):
    """
    keep_prob: probability of KEEPING a neuron (e.g., 0.8 means 20% dropout)
    """
    # Generate random mask
    D = np.random.rand(*A_prev.shape) < keep_prob  # Boolean mask

    # Apply mask: zero out dropped neurons
    A = A_prev * D

    # Scale up by 1/keep_prob so expected value stays the same
    # This is the "inverted" part — no scaling needed at test time
    A /= keep_prob

    return A, D

# At test time: no dropout, no scaling — just use the network normally!

Why inverted dropout?

By scaling during training (dividing by keep_prob), the expected value of each neuron's output stays the same whether dropout is on or off. This means at test time you don't need to change anything — no special test-time scaling. The original dropout paper required multiplying by keep_prob at test time, which is error-prone.

Batch Normalization

Normalize the activations within each mini-batch to have zero mean and unit variance, then let the network learn the optimal scale (γ) and shift (β).

Why it works (Ng's explanation): imagine the network is trying to learn a mapping from input to output. If the input distribution to a later layer keeps shifting as earlier layers update (called "internal covariate shift"), the later layer is chasing a moving target. Batch norm stabilizes each layer's input distribution, making training faster and more reliable.

def batch_norm(Z, gamma, beta, epsilon=1e-8):
    """
    Z: (n, m) — pre-activation values for one layer, one mini-batch
    gamma: (n, 1) — learned scale parameter
    beta: (n, 1) — learned shift parameter (NOT the β from Adam!)
    """
    # Step 1: compute mean and variance across the batch dimension
    mu = np.mean(Z, axis=1, keepdims=True)        # (n, 1)
    var = np.var(Z, axis=1, keepdims=True)          # (n, 1)

    # Step 2: normalize
    Z_norm = (Z - mu) / np.sqrt(var + epsilon)

    # Step 3: scale and shift (learnable)
    Z_tilde = gamma * Z_norm + beta

    return Z_tilde, mu, var

Batch norm is inserted after the linear transform, before the activation: Z = WA + b → BatchNorm(Z) → ReLU. With batch norm, you can drop the bias term b since batch norm subtracts the mean anyway.

Batch norm also regularizes

Each mini-batch has a slightly different mean and variance, adding noise to the activations. This noise acts as a mild regularizer — similar in spirit to dropout. If you use batch norm, you can often reduce or remove dropout.

Early Stopping

Monitor validation loss during training. When it stops improving (starts going back up), stop training. Simple and effective, but Ng notes it "couples" the optimization objective (fitting training data) with the regularization objective (not overfitting), which can make debugging harder.

Data Augmentation

For images: random flips, rotations, crops, color jitter, scaling. You're saying to the model: "a cat flipped horizontally is still a cat." This effectively multiplies your training data without collecting new examples.

fast.ai insight: progressive resizing

Start training on small images (e.g., 64x64), then increase to 128x128, then 224x224. Early epochs on small images are much faster, and the model learns coarse features first. When you increase resolution, it refines those features. This is a form of curriculum learning — and it often beats training at full resolution from the start, while being significantly cheaper.

Technique	When to Use	Note
L2 / Weight decay	Always (usually built into optimizer)	Typical: 1e-4 to 1e-2
Dropout	Large fully-connected layers	Typical p: 0.2–0.5
Batch norm	Almost always in CNNs	Also speeds up training
Early stopping	When you have a validation set	Easy to implement
Data augmentation	Images, audio, text	Free data!
Mixup	Image classification	Blend two images + labels

Convolutional Neural Networks

A 256x256 color image has 196,608 pixels. A fully connected first layer with 1000 neurons would need ~197 million weights — just for layer one. That's insane. And it ignores spatial structure: pixel (0,0) is treated the same as pixel (255,255).

CNNs solve this with two key ideas: local connectivity (each neuron only looks at a small region) and weight sharing (the same filter is applied across the entire image).

The Convolution Operation

A filter (or kernel) is a small matrix (e.g., 3x3) that slides across the image. At each position, it computes a dot product — "how much does this patch match my pattern?" The result is a feature map.

Think of it as a flashlight scanning a dark room. The flashlight (filter) illuminates a small area at a time. A "vertical edge detector" filter lights up when it finds a vertical edge. A "horizontal edge detector" finds horizontal edges. Stack many filters and you detect many features simultaneously.

import numpy as np

def conv2d(image, kernel, stride=1, padding=0):
    """Simple 2D convolution (single channel, single filter)"""
    if padding > 0:
        image = np.pad(image, padding, mode='constant')

    h, w = image.shape
    kh, kw = kernel.shape
    out_h = (h - kh) // stride + 1
    out_w = (w - kw) // stride + 1
    output = np.zeros((out_h, out_w))

    for i in range(out_h):
        for j in range(out_w):
            region = image[i*stride:i*stride+kh, j*stride:j*stride+kw]
            output[i, j] = np.sum(region * kernel)

    return output

# Vertical edge detector
image = np.array([
    [10, 10, 0, 0],
    [10, 10, 0, 0],
    [10, 10, 0, 0],
    [10, 10, 0, 0],
], dtype=float)

vertical_edge = np.array([
    [1, 0, -1],
    [1, 0, -1],
    [1, 0, -1],
], dtype=float)

edges = conv2d(image, vertical_edge)
print(edges)
# [[30, 0],
#  [30, 0]]  — detected the vertical edge!

Key Concepts

Stride: how many pixels the filter moves each step. Stride 2 halves the spatial dimensions.
Padding: adding zeros around the border. "Same" padding preserves dimensions; "valid" (no padding) shrinks them.
Receptive field: how much of the original input each neuron "sees." Deeper layers have larger receptive fields — they see more context.
Number of parameters: a 3x3 filter with 64 input channels and 128 output filters has 3 × 3 × 64 × 128 + 128 = 73,856 weights. Compare to ~197M for a fully connected layer!

Pooling

Max pooling: take the maximum value in each region. Provides translation invariance ("the feature is somewhere in this area") and reduces spatial dimensions.

Average pooling: take the average. Often used in the final layer (global average pooling replaces the fully connected layer).

CNN Architecture Pattern

The classic flow: Conv → ReLU → Pool → ... → Flatten → Dense → Output

As you go deeper: spatial dimensions shrink (via stride/pooling) while channel count grows (more filters). 224 × 224 × 3 → 112 × 112 × 64 → 56 × 56 × 128 → ... → 7 × 7 × 512 → flatten → dense → output.

Landmark Architectures

Architecture	Year	Key Idea	Depth
LeNet-5	1998	The pioneer — conv + pool + FC	5
AlexNet	2012	ReLU, dropout, GPU training	8
VGGNet	2014	"Just stack 3x3 convs"	16-19
GoogLeNet/Inception	2014	Parallel filters of different sizes	22
ResNet	2015	Skip connections	50-152
EfficientNet	2019	Compound scaling (width + depth + resolution)	varies

ResNet: The Skip Connection Revolution

Before ResNet, making networks deeper actually hurt performance — not just from overfitting, but from optimization difficulty. A 56-layer network performed worse than a 20-layer one.

The fix is brilliantly simple: skip connections (residual connections). Instead of learning the full mapping H(x), learn the residual F(x) = H(x) - x. The output becomes:

a^[l+2] = ReLU(z^[l+2] + a^[l])

Analogy: instead of building a bridge from scratch, start with the existing road (the identity, x) and learn what modifications to add. If the optimal modification is "nothing" (the identity mapping), the network just learns F(x) = 0, which is easy. Without skip connections, learning the identity through multiple non-linear layers is surprisingly hard.

1x1 Convolutions

A 1x1 convolution seems pointless — a filter that only looks at one pixel? But remember: it operates across channels. A 1x1 conv with 64 input channels and 32 output channels is a learned linear combination that reduces dimensionality. GoogLeNet uses this extensively to keep computation manageable.

Transfer Learning

The most important practical technique in modern deep learning. Instead of training from scratch:

Take a model pre-trained on a large dataset (ImageNet, 14M images)
Replace the final classification layer with one for your task
Fine-tune on your smaller dataset

Why it works: early layers learn universal features (edges, textures, shapes) that transfer across tasks. A cat detector and a car detector both need edge detection. Only the final layers are task-specific.

fast.ai insight: gradual unfreezing with discriminative learning rates

Don't just fine-tune the last layer and call it done. Jeremy Howard's approach:

Freeze all pre-trained layers. Train only the new head for 1-3 epochs.
Unfreeze the last group of layers. Train with a lower learning rate for earlier layers than later layers (discriminative LRs — e.g., 1e-5 for early layers, 1e-3 for the head).
Optionally unfreeze everything with even more aggressive discriminative rates.

The intuition: early layers have good universal features that need only gentle adjustment. Later layers need more aggressive learning to adapt to your task. This consistently outperforms the naive "freeze everything except the last layer" approach.

Don't train from scratch unless you must

Unless you have millions of labeled images and significant compute, start with a pre-trained model. Even if your domain looks nothing like ImageNet (medical images, satellite imagery), transfer learning almost always helps. "Pre-trained features are better than random features" is one of the most reliable findings in deep learning.

Sequence Models

Text, speech, time series, DNA sequences — data where order matters. "Dog bites man" and "Man bites dog" have the same words but very different meanings. Standard neural networks treat inputs as fixed-size, unordered vectors. Sequence models understand order.

Recurrent Neural Networks (RNNs)

An RNN processes one element at a time, maintaining a hidden state that serves as "memory" of what it's seen so far. At each time step:

h_t = tanh(W_hh · h_t-1 + W_xh · x_t + b)

Think of it as reading a book word by word. After each word, you update your mental summary (hidden state) of the story so far. Your understanding of "bank" depends on whether the preceding words were about money or rivers.

import numpy as np

class SimpleRNN:
    def __init__(self, input_size, hidden_size, output_size):
        # Xavier initialization
        scale_h = np.sqrt(1 / hidden_size)
        scale_x = np.sqrt(1 / input_size)

        self.Whh = np.random.randn(hidden_size, hidden_size) * scale_h
        self.Wxh = np.random.randn(hidden_size, input_size) * scale_x
        self.Why = np.random.randn(output_size, hidden_size) * scale_h
        self.bh = np.zeros((hidden_size, 1))
        self.by = np.zeros((output_size, 1))

    def forward(self, inputs, h_prev):
        """
        inputs: list of (input_size, 1) vectors, one per time step
        h_prev: initial hidden state (hidden_size, 1)
        """
        h = h_prev
        hidden_states = []

        for x_t in inputs:
            h = np.tanh(self.Whh @ h + self.Wxh @ x_t + self.bh)
            hidden_states.append(h)

        # Output from final hidden state
        y = self.Why @ h + self.by
        return y, hidden_states

The Vanishing Gradient Problem

During backpropagation through time (BPTT), gradients are multiplied by W_hh at each time step. Over a 100-step sequence, it's like multiplying a number by 0.9 a hundred times: 0.9¹⁰⁰ ≈ 0.0000265. The gradient effectively disappears, and the network can't learn long-range dependencies.

"The cat, which already ate a full meal and was lounging by the window watching the birds, was not hungry." An RNN struggles to connect "cat" to "was" across all those intervening words.

LSTM (Long Short-Term Memory)

LSTMs solve the vanishing gradient problem with a cell state — a "conveyor belt" that runs through the entire sequence. Information can flow along it unchanged, or be modified by three gates:

Forget gate (f): "What should I forget from the cell state?" Sigmoid output — 0 means forget, 1 means keep.
Input gate (i): "What new information should I add?" Controls which values to update.
Output gate (o): "What should I output from the cell state?" Filters the cell state for the hidden state.

def lstm_cell(x_t, h_prev, c_prev, params):
    """
    Single LSTM time step.
    x_t: (n_x, 1) input
    h_prev: (n_h, 1) previous hidden state
    c_prev: (n_h, 1) previous cell state
    """
    Wf, Wi, Wc, Wo = params['Wf'], params['Wi'], params['Wc'], params['Wo']
    bf, bi, bc, bo = params['bf'], params['bi'], params['bc'], params['bo']

    concat = np.vstack([h_prev, x_t])  # stack for single matrix multiply

    # Gates (all sigmoid → values between 0 and 1)
    f = sigmoid(Wf @ concat + bf)  # Forget gate: what to erase
    i = sigmoid(Wi @ concat + bi)  # Input gate: what to write
    o = sigmoid(Wo @ concat + bo)  # Output gate: what to expose

    # Candidate cell state
    c_candidate = np.tanh(Wc @ concat + bc)

    # Update cell state: forget some old + add some new
    c_next = f * c_prev + i * c_candidate

    # Hidden state: filtered cell state
    h_next = o * np.tanh(c_next)

    return h_next, c_next

The cell state acts as a highway: gradients can flow backward through it with minimal decay (the forget gate just multiplies by a value close to 1). This is why LSTMs can capture dependencies across hundreds of time steps.

GRU (Gated Recurrent Unit)

A simplified LSTM with only two gates (reset and update), combining the forget and input gates. Fewer parameters, often comparable performance:

Update gate (z): how much of the previous state to keep (like forget + input combined)
Reset gate (r): how much of the previous state to use when computing the new candidate

LSTM vs GRU

Neither is clearly better. LSTM is more expressive (more parameters), GRU is faster and works better on smaller datasets. Start with LSTM for complex tasks, GRU for simpler ones or when speed matters. In practice, the difference is often negligible.

Bidirectional RNNs

A forward RNN only sees past context. But in "He said 'Teddy bears are great'", understanding "Teddy" requires seeing "bears" (future context). A bidirectional RNN runs two RNNs: one forward, one backward. At each time step, it concatenates both hidden states, giving access to both past and future context.

The transformer revolution

For modern NLP, transformers (attention mechanisms, self-attention, positional encodings) have largely replaced RNNs. They process the entire sequence in parallel and naturally capture long-range dependencies without vanishing gradients. See the LLMs refresher for attention and transformer architectures.

Practical Deep Learning

Andrew Ng's Course 3 (Structuring Machine Learning Projects) is entirely about practical advice. This section distills the most important lessons.

Train / Dev / Test Splits for Deep Learning

Classical ML used 60/20/20 splits. With modern datasets of millions of examples, you only need a small fraction for dev and test:

Small data (<10K): 60% train / 20% dev / 20% test
Medium data (10K-1M): 90% train / 5% dev / 5% test
Large data (1M+): 98% train / 1% dev / 1% test — even 1% of 10M is 100K examples

Distribution mismatch

Your dev and test sets must come from the same distribution — the one you care about in production. If you train on web images but deploy on mobile photos, your dev/test sets should be mobile photos. Training data can come from a different distribution (that's fine), but dev/test must match reality.

The Deep Learning Recipe

Ng's systematic approach when your model isn't performing well:

Does the model have high bias? (Training error is much worse than human-level)
- Yes → Bigger network, train longer, try different architecture
Does the model have high variance? (Dev error is much worse than training error)
- Yes → More data, regularization, data augmentation, simpler architecture
Both? Fix bias first (bigger model), then address variance

The deep learning "free lunch"

In classical ML, reducing bias and variance were often in tension. Deep learning breaks this: you can almost always reduce bias by making the network bigger without significantly hurting variance (as long as you regularize). This is why deep learning scales so well.

Error Analysis

When your model has 10% error, don't just throw more data or a bigger model at it. Manually examine 100 misclassified dev examples and categorize them:

Error Category	Count	% of Errors
Blurry images	32	32%
Mislabeled ground truth	25	25%
Unusual angle/pose	18	18%
Small object in frame	15	15%
Other	10	10%

Now you know: fixing blurry images (denoising, better data) addresses 32% of errors. Fixing labels addresses 25%. This 30-minute exercise saves weeks of blind experimentation.

Transfer Learning: When It Works

Transfer from task A to task B works when:

Task A has much more data than task B
Low-level features from A are useful for B
Both have the same input type (images, text, etc.)

Examples: ImageNet → medical imaging (YES), English NLP → French NLP (YES), image classification → speech recognition (NO — different input type).

Hyperparameter Tuning

Ng's priority ordering for hyperparameters:

Learning rate (most important by far)
Mini-batch size, number of hidden units
Number of layers, learning rate decay
Adam parameters β₁, β₂, ε (rarely need to change from defaults)

Random search over grid search

Don't use an evenly spaced grid. Use random sampling — if one hyperparameter matters much more than others, random search explores more unique values of the important one. Use log scale for learning rates: sample from [1e-4, 1e-1] on a log scale, not linearly.

End-to-End Deep Learning

Should you replace a multi-stage pipeline (audio → phonemes → words → sentences) with one end-to-end model (audio → sentences)?

Pros: lets the data speak, no hand-designed features, simpler system
Cons: needs much more data, excludes potentially useful hand-designed knowledge

Ng's advice: use end-to-end when you have lots of data for the complete input-output mapping. When data is limited, a pipeline with hand-designed intermediate steps often works better.

Generative Models

Everything so far has been discriminative: given input, predict a label. Generative models learn to create new data that looks like the training data — new images, music, text.

Autoencoders

An encoder compresses the input into a small bottleneck vector (the latent representation), and a decoder reconstructs the original input from it. The bottleneck forces the network to learn a compact, meaningful representation.

Use cases: dimensionality reduction, denoising (train to reconstruct clean images from noisy ones), pre-training feature extractors.

Variational Autoencoders (VAEs)

Standard autoencoders learn discrete points in latent space. VAEs learn a smooth probability distribution instead. This means you can sample from the distribution and generate new, plausible examples. Nearby points in latent space produce similar outputs — smooth interpolation between a cat and a dog, or between a "5" and a "3".

Generative Adversarial Networks (GANs)

The "counterfeiter vs detective" game:

Generator (G): creates fake images from random noise. Its goal: fool the discriminator.
Discriminator (D): receives real images and fakes, tries to tell them apart. Its goal: catch the fakes.

They train simultaneously. G gets better at creating fakes, D gets better at catching them, pushing G to produce increasingly realistic outputs. At equilibrium, D can't tell real from fake — G has learned to generate convincingly realistic data.

GAN training is notoriously unstable

Mode collapse: G learns to produce only a few types of outputs, ignoring the diversity of real data
Training oscillation: G and D "chase" each other without converging
Vanishing gradients: if D becomes too strong, G gets no useful signal

Techniques like WGAN (Wasserstein GAN), progressive growing, and spectral normalization help, but GANs remain finicky compared to other generative approaches.

Diffusion Models

The latest breakthrough in image generation (DALL-E 2, Stable Diffusion, Midjourney). The idea:

Forward process: gradually add Gaussian noise to a real image over many steps until it's pure noise.
Reverse process: train a neural network to reverse each noise step — given a noisy image, predict the slightly-less-noisy version.

To generate a new image: start from pure noise and iteratively denoise. Each step is a small, tractable problem, and the results are stunning.

Diffusion models produce higher-quality, more diverse outputs than GANs, with more stable training. The trade-off: they're slower to generate (many denoising steps vs. one forward pass for GANs).

Text generation

For generating text, autoregressive language models (GPT family) and their training approaches are covered in the LLMs refresher.

What's Next

Deep learning is the foundation. Here's where to go from here:

Transformers & LLMs

The attention mechanism allows models to look at all positions in a sequence simultaneously, solving the long-range dependency problem that plagued RNNs. Transformers are the backbone of GPT, BERT, and all modern language models. See the LLMs refresher.

Framework Mastery

Everything in this page was numpy from scratch — great for understanding, but you'll use frameworks for real work. PyTorch provides automatic differentiation, GPU acceleration, and pre-built components. See the PyTorch refresher.

Reinforcement Learning

Learning from rewards instead of labels. An agent takes actions in an environment, receives rewards, and learns a policy to maximize cumulative reward. Key concepts:

Q-learning: learn the value of each (state, action) pair
Policy gradients: directly optimize the policy (mapping from states to actions)
Actor-critic: combine value estimation with policy optimization

Applications: game playing (AlphaGo), robotics, RLHF for language models.

Current Frontiers

Self-supervised learning: learn from unlabeled data by predicting parts of the input (masked language modeling, contrastive learning). This is how foundation models are trained.
Multimodal models: jointly understanding images, text, audio, and video (CLIP, GPT-4V, Gemini).
Efficient architectures: making models smaller and faster without losing quality (distillation, quantization, pruning, mixture of experts).
AI agents: combining LLMs with tool use, memory, and planning for autonomous task completion.

Quick Reference: Deep Learning Decision Guide

Task	Architecture	Start With
Image classification	CNN (ResNet, EfficientNet)	Pre-trained + fine-tune
Object detection	YOLO, Faster R-CNN	Pre-trained backbone
Text classification	Transformer (BERT)	Pre-trained + fine-tune
Text generation	Autoregressive LM (GPT)	API or fine-tuned model
Time series	LSTM / Temporal CNN	LSTM baseline
Tabular data	Gradient boosted trees first!	XGBoost / LightGBM
Image generation	Diffusion model	Stable Diffusion
Speech recognition	Transformer (Whisper)	Pre-trained Whisper