Deep Learning Refresher
Neural networks from scratch to CNNs, RNNs, and beyond — the Andrew Ng way
Table of Contents
From Logistic Regression to Neural Networks
If you understand logistic regression, you already understand a single neuron. Logistic regression takes inputs, multiplies by weights, adds a bias, and passes through a sigmoid to output a probability. A neuron does exactly that.
So why do we need more? Because a single neuron can only draw a straight line (or hyperplane) to separate classes. Consider the XOR problem: points at (0,0) and (1,1) are class 0, points at (0,1) and (1,0) are class 1. No single straight line can separate them. But stack two neurons together and it's trivial.
Network Architecture
Think of a neural network as an assembly line. Raw materials (pixels, words, numbers) enter at one end. Each station (layer) transforms them, extracting increasingly abstract features. At the other end, out comes a prediction.
- Input layer: your raw features. For a 64x64 image, that's 12,288 numbers (64 × 64 × 3 color channels).
- Hidden layers: where the learning happens. Layer 1 might detect edges, layer 2 combines edges into shapes, layer 3 recognizes objects.
- Output layer: the final answer. One sigmoid neuron for binary classification, or a softmax layer for multi-class.
Notation
Following Ng's convention (superscript in brackets means layer):
| Symbol | Meaning | Shape |
|---|---|---|
| W[l] | Weight matrix for layer l | (n[l], n[l-1]) |
| b[l] | Bias vector for layer l | (n[l], 1) |
| z[l] | Linear output: W[l]a[l-1] + b[l] | (n[l], 1) |
| a[l] | Activation output: g(z[l]) | (n[l], 1) |
| a[0] | Input features (same as x) | (n[0], 1) |
Forward Propagation
Forward prop is "run the network": push inputs through each layer, left to right, until you get a prediction. Let's trace through a concrete example.
Step by Step
A 2-layer network: 3 inputs → 4 hidden neurons (ReLU) → 1 output (sigmoid).
import numpy as np
# --- Layer by layer ---
# Input: x = [1.0, 0.5, -0.3]
x = np.array([[1.0], [0.5], [-0.3]]) # shape (3, 1)
# Layer 1: Linear + ReLU
W1 = np.array([[ 0.2, -0.1, 0.4],
[ 0.5, 0.3, -0.2],
[-0.3, 0.1, 0.6],
[ 0.1, 0.4, 0.2]]) # shape (4, 3)
b1 = np.array([[0.1], [-0.1], [0.0], [0.2]]) # shape (4, 1)
z1 = W1 @ x + b1 # Linear: z = Wx + b
# z1 = [[0.05], [0.71], [-0.02], [0.44]]
a1 = np.maximum(0, z1) # ReLU: max(0, z)
# a1 = [[0.05], [0.71], [0.00], [0.44]] (negative value clipped to 0)
# Layer 2: Linear + Sigmoid
W2 = np.array([[0.3, -0.2, 0.5, 0.1]]) # shape (1, 4)
b2 = np.array([[-0.1]]) # shape (1, 1)
z2 = W2 @ a1 + b2
# z2 = 0.3*0.05 + (-0.2)*0.71 + 0.5*0.00 + 0.1*0.44 + (-0.1) = -0.183
a2 = 1 / (1 + np.exp(-z2)) # Sigmoid
# a2 ≈ 0.454 → prediction: probability of class 1
print(f"z1 = {z1.flatten()}")
print(f"a1 = {a1.flatten()}")
print(f"z2 = {z2.flatten()}")
print(f"a2 = {a2.flatten():.4f}") # ≈ 0.4544
Vectorization Across Examples
In practice you don't process one example at a time. Stack all m examples into a matrix X of shape (n, m), and the same code processes them all simultaneously:
def forward_propagation(X, parameters):
"""
X: shape (n_features, m_examples)
parameters: dict with W1, b1, W2, b2, ...
Returns: predictions and cache for backprop
"""
cache = {'A0': X}
A = X
L = len(parameters) // 2 # number of layers
for l in range(1, L + 1):
W = parameters[f'W{l}']
b = parameters[f'b{l}']
Z = W @ A + b # broadcasting handles (n, 1) + (n, m)
if l == L:
A = 1 / (1 + np.exp(-Z)) # sigmoid for output layer
else:
A = np.maximum(0, Z) # ReLU for hidden layers
cache[f'Z{l}'] = Z
cache[f'A{l}'] = A
return A, cache
@ uses optimized BLAS routines. Always think in matrices: stack examples as columns, and one matrix multiply processes the entire batch.
Backpropagation
Backprop answers: "if I wiggle this weight slightly, how much does the loss change?" That's the gradient — and once you have it, you can nudge weights to reduce the loss.
The Chain Rule Intuition
Think of the network as a chain of functions: input → layer 1 → layer 2 → ... → loss. The chain rule says: to find how a change in an early weight affects the final loss, multiply the "sensitivity" at each link in the chain.
Analogy: imagine a row of dominoes. Pushing the first one harder (changing a weight) affects how hard the last one falls (the loss). The chain rule computes exactly how much, by multiplying the transfer at each step.
The Key Equations
For a network with L layers, working backward from the output:
def backward_propagation(Y, cache, parameters):
"""
Y: true labels, shape (1, m)
cache: from forward_propagation
parameters: W, b for each layer
Returns: gradients dict
"""
m = Y.shape[1]
L = len(parameters) // 2
gradients = {}
# Output layer gradient (binary cross-entropy + sigmoid)
AL = cache[f'A{L}']
dZ = AL - Y # This elegant simplification comes from calculus
for l in range(L, 0, -1):
A_prev = cache[f'A{l-1}']
# Gradients for weights and biases
gradients[f'dW{l}'] = (1 / m) * (dZ @ A_prev.T)
gradients[f'db{l}'] = (1 / m) * np.sum(dZ, axis=1, keepdims=True)
if l > 1:
W = parameters[f'W{l}']
dA_prev = W.T @ dZ
# ReLU derivative: 1 if z > 0, else 0
Z_prev = cache[f'Z{l-1}']
dZ = dA_prev * (Z_prev > 0).astype(float)
return gradients
Full derivation of backprop for a 2-layer network
For binary cross-entropy loss: L = -(1/m) Σ [y log(a) + (1-y) log(1-a)]
Starting from the output (layer 2, sigmoid activation):
- dZ[2] = A[2] - Y — derivative of loss w.r.t. z[2]
- dW[2] = (1/m) dZ[2] · A[1]ᵀ — how loss changes w.r.t. W[2]
- db[2] = (1/m) Σ dZ[2] — average over examples
- dA[1] = W[2]ᵀ · dZ[2] — propagate gradient backward through the linear layer
- dZ[1] = dA[1] * g'(Z[1]) — chain rule through activation. For ReLU, g'(z) = 1 if z > 0, else 0
- dW[1] = (1/m) dZ[1] · Xᵀ
- db[1] = (1/m) Σ dZ[1]
Notice the pattern: at each layer, compute dZ, then dW, db, and dA for the previous layer. It's the same three equations repeated.
Activation Functions
Without activation functions, a deep network is just a stack of linear transformations — which collapses into one big linear transformation. Activations introduce non-linearity, giving the network the ability to learn curves, not just lines.
Sigmoid
σ(z) = 1 / (1 + e⁻ᶻ) — squashes to (0, 1).
Good for: output layer of binary classification (gives a probability). Bad for: hidden layers of deep networks. Why?
- Vanishing gradients: when z is very large or very small, σ'(z) ≈ 0. Gradients die as they propagate backward through many sigmoid layers.
- Not zero-centered: outputs are always positive, causing zig-zag updates in gradient descent.
Tanh
tanh(z) = (eᶻ - e⁻ᶻ) / (eᶻ + e⁻ᶻ) — squashes to (-1, 1).
Better than sigmoid for hidden layers because it's zero-centered. But still suffers from vanishing gradients at the extremes.
ReLU
ReLU(z) = max(0, z) — the "if positive, pass through; if negative, kill it" function.
This seemingly trivial function revolutionized deep learning:
- No vanishing gradient (for positive values): the derivative is just 1
- Computationally cheap: just a comparison, no exponentials
- Sparse activation: typically ~50% of neurons output zero, making the network naturally sparse and efficient
Leaky ReLU & Variants
Leaky ReLU(z) = max(αz, z) where α is small (e.g., 0.01). Negative inputs get a small gradient instead of zero — neurons can't die.
Softmax
For multi-class classification: converts a vector of raw scores into probabilities that sum to 1.
softmax(zᵢ) = eᶻⁱ / Σⱼ eᶻʲ
def softmax(z):
# Subtract max for numerical stability (prevents overflow)
exp_z = np.exp(z - np.max(z, axis=0, keepdims=True))
return exp_z / np.sum(exp_z, axis=0, keepdims=True)
# Example: 3-class output
scores = np.array([[2.0], [1.0], [0.1]])
probs = softmax(scores)
print(probs.flatten()) # [0.659, 0.242, 0.099] — sums to 1.0
Comparison Table
| Activation | Range | Pros | Cons | Use For |
|---|---|---|---|---|
| Sigmoid | (0, 1) | Probability output | Vanishing gradient | Binary output layer |
| Tanh | (-1, 1) | Zero-centered | Vanishing gradient | RNN hidden states |
| ReLU | [0, ∞) | Fast, no vanishing grad | Dying neurons | Default for hidden layers |
| Leaky ReLU | (-∞, ∞) | No dying neurons | Extra hyperparameter α | When ReLU neurons die |
| GELU | (-0.17, ∞) | Smooth, used in transformers | Slower to compute | Transformer models |
| Softmax | (0, 1), sum=1 | Multi-class probabilities | Only for output layer | Multi-class output |
Weight Initialization
Initialization is like choosing your starting position on a mountain before hiking to the valley. A bad start means a longer hike or getting stuck on a plateau.
Why Not Zero?
If all weights are zero (or the same value), every neuron in a layer computes the exact same thing. Their gradients are identical, so they update identically. They stay identical forever. This is the symmetry problem — 100 neurons that all do the same thing is just 1 neuron.
Random Initialization
Random breaks symmetry, but the scale matters enormously:
- Too large: activations explode through layers. With sigmoid/tanh, outputs saturate at extremes where gradients are zero.
- Too small: activations shrink toward zero through layers. Gradients vanish.
Xavier / Glorot Initialization
For sigmoid and tanh activations. The key insight: to keep the variance of activations stable across layers, initialize with:
W ~ N(0, 1/n[l-1]) — or equivalently, W ~ Uniform(-√(1/n), √(1/n))
where n[l-1] is the number of inputs to the layer.
He Initialization
For ReLU activations. Since ReLU zeros out half the outputs, you need double the variance:
W ~ N(0, 2/n[l-1])
import numpy as np
def initialize_parameters(layer_dims):
"""
layer_dims: list like [784, 128, 64, 10]
Returns: dict with W1, b1, W2, b2, ...
"""
parameters = {}
for l in range(1, len(layer_dims)):
n_in = layer_dims[l - 1]
n_out = layer_dims[l]
# He initialization (for ReLU)
parameters[f'W{l}'] = np.random.randn(n_out, n_in) * np.sqrt(2 / n_in)
parameters[f'b{l}'] = np.zeros((n_out, 1)) # biases can be zero
return parameters
# Example: 784 → 128 → 64 → 10 network
params = initialize_parameters([784, 128, 64, 10])
print(f"W1 std: {params['W1'].std():.4f}") # ≈ sqrt(2/784) ≈ 0.0505
print(f"W2 std: {params['W2'].std():.4f}") # ≈ sqrt(2/128) ≈ 0.1250
Optimization
Gradient descent finds the bottom of the loss landscape. But how you descend matters. Plain gradient descent works but is slow — like walking straight downhill on every step without considering momentum or terrain.
Mini-Batch Gradient Descent
Three flavors:
| Method | Batch Size | Pros | Cons |
|---|---|---|---|
| Batch GD | Entire dataset (m) | Stable convergence, low noise | Slow per step, needs all data in memory |
| Stochastic GD | 1 example | Fast updates, can escape local minima | Very noisy, no vectorization benefit |
| Mini-batch GD | 32–512 examples | Best of both: fast + stable | Need to choose batch size |
In practice, everyone uses mini-batch. Common sizes: 32, 64, 128, 256. Powers of 2 are slightly faster on GPUs due to memory alignment.
Exponentially Weighted Moving Averages
Before understanding momentum and Adam, you need this building block. An exponentially weighted moving average (EWMA) smooths a noisy sequence:
vt = β · vt-1 + (1 - β) · θt
β controls how much "memory" the average has. β = 0.9 roughly averages the last 10 values. β = 0.99 averages the last 100. Higher β = smoother but slower to react.
Momentum
Standard gradient descent with no memory: each step depends only on the current gradient. This causes oscillations in dimensions with high curvature — the path zig-zags like a drunk person walking downhill.
Momentum adds "inertia" — a rolling ball analogy. The ball accumulates velocity in the consistent downhill direction and dampens the oscillating directions:
def gradient_descent_with_momentum(parameters, gradients, v, learning_rate, beta=0.9):
"""
v: velocity (exponentially weighted average of gradients)
"""
for l in range(1, L + 1):
# Update velocity: keep rolling + new gradient
v[f'dW{l}'] = beta * v[f'dW{l}'] + (1 - beta) * gradients[f'dW{l}']
v[f'db{l}'] = beta * v[f'db{l}'] + (1 - beta) * gradients[f'db{l}']
# Update parameters using velocity (not raw gradient)
parameters[f'W{l}'] -= learning_rate * v[f'dW{l}']
parameters[f'b{l}'] -= learning_rate * v[f'db{l}']
return parameters, v
RMSProp
Different idea: adapt the learning rate per parameter. Parameters with large gradients get smaller learning rates (to prevent overshooting), and parameters with small gradients get larger learning rates (to speed up).
def rmsprop(parameters, gradients, s, learning_rate, beta=0.999, epsilon=1e-8):
"""
s: cache of squared gradients (exponentially weighted)
"""
for l in range(1, L + 1):
# Accumulate squared gradients
s[f'dW{l}'] = beta * s[f'dW{l}'] + (1 - beta) * gradients[f'dW{l}'] ** 2
s[f'db{l}'] = beta * s[f'db{l}'] + (1 - beta) * gradients[f'db{l}'] ** 2
# Update: divide by sqrt of accumulated squared gradient
parameters[f'W{l}'] -= learning_rate * gradients[f'dW{l}'] / (np.sqrt(s[f'dW{l}']) + epsilon)
parameters[f'b{l}'] -= learning_rate * gradients[f'db{l}'] / (np.sqrt(s[f'db{l}']) + epsilon)
return parameters, s
Adam (Adaptive Moment Estimation)
Momentum + RMSProp combined. Maintains both a velocity (first moment, like momentum) and a cache of squared gradients (second moment, like RMSProp). It's the default optimizer for most deep learning.
def adam(parameters, gradients, v, s, t, learning_rate=0.001,
beta1=0.9, beta2=0.999, epsilon=1e-8):
"""
v: first moment (momentum)
s: second moment (RMSProp)
t: iteration count (for bias correction)
"""
for l in range(1, L + 1):
# Momentum update
v[f'dW{l}'] = beta1 * v[f'dW{l}'] + (1 - beta1) * gradients[f'dW{l}']
v[f'db{l}'] = beta1 * v[f'db{l}'] + (1 - beta1) * gradients[f'db{l}']
# RMSProp update
s[f'dW{l}'] = beta2 * s[f'dW{l}'] + (1 - beta2) * gradients[f'dW{l}'] ** 2
s[f'db{l}'] = beta2 * s[f'db{l}'] + (1 - beta2) * gradients[f'db{l}'] ** 2
# Bias correction (important in early iterations)
v_corrected_W = v[f'dW{l}'] / (1 - beta1 ** t)
v_corrected_b = v[f'db{l}'] / (1 - beta1 ** t)
s_corrected_W = s[f'dW{l}'] / (1 - beta2 ** t)
s_corrected_b = s[f'db{l}'] / (1 - beta2 ** t)
# Update parameters
parameters[f'W{l}'] -= learning_rate * v_corrected_W / (np.sqrt(s_corrected_W) + epsilon)
parameters[f'b{l}'] -= learning_rate * v_corrected_b / (np.sqrt(s_corrected_b) + epsilon)
return parameters, v, s
Learning Rate Schedules
A fixed learning rate is often suboptimal. Start large (to make progress fast), then reduce (to fine-tune near the optimum).
| Schedule | How It Works | When to Use |
|---|---|---|
| Step decay | Reduce by factor every N epochs | Simple, predictable |
| Exponential decay | lr = lr₀ · e-kt | Smooth decay |
| Cosine annealing | lr follows a cosine curve to near-zero | Popular for fine-tuning |
| Warmup + decay | Ramp up lr for first N steps, then decay | Transformers, large batches |
| One-cycle | Warmup to max, then cosine decay to near-zero | fast.ai's recommended default |
lr_find() implements this — one of the most practical tricks in deep learning.
Regularization for Deep Networks
Deep networks have millions of parameters — they can memorize anything. Regularization prevents this by constraining what the network can learn.
L2 Regularization (Weight Decay)
Same idea as in classical ML — add λ·||W||² to the loss. For neural networks, this is often called weight decay because the update rule effectively multiplies weights by (1 - α·λ/m) each step, "decaying" them toward zero.
# In the cost function, add L2 penalty
def compute_cost_with_l2(AL, Y, parameters, lambd):
m = Y.shape[1]
cross_entropy = -(1 / m) * np.sum(Y * np.log(AL) + (1 - Y) * np.log(1 - AL))
l2_penalty = 0
L = len(parameters) // 2
for l in range(1, L + 1):
l2_penalty += np.sum(parameters[f'W{l}'] ** 2)
l2_penalty *= lambd / (2 * m)
return cross_entropy + l2_penalty
# In backprop, add regularization gradient: dW += (lambd/m) * W
Dropout
During training, randomly "turn off" each neuron with probability p. The remaining neurons must learn to be useful on their own, without relying on specific co-activations with other neurons.
Analogy: a team where any member might be absent on any day. Each person must be capable independently — no one can rely on a specific colleague always being there.
Inverted Dropout (Ng's preferred form)
def forward_with_dropout(A_prev, keep_prob):
"""
keep_prob: probability of KEEPING a neuron (e.g., 0.8 means 20% dropout)
"""
# Generate random mask
D = np.random.rand(*A_prev.shape) < keep_prob # Boolean mask
# Apply mask: zero out dropped neurons
A = A_prev * D
# Scale up by 1/keep_prob so expected value stays the same
# This is the "inverted" part — no scaling needed at test time
A /= keep_prob
return A, D
# At test time: no dropout, no scaling — just use the network normally!
Batch Normalization
Normalize the activations within each mini-batch to have zero mean and unit variance, then let the network learn the optimal scale (γ) and shift (β).
Why it works (Ng's explanation): imagine the network is trying to learn a mapping from input to output. If the input distribution to a later layer keeps shifting as earlier layers update (called "internal covariate shift"), the later layer is chasing a moving target. Batch norm stabilizes each layer's input distribution, making training faster and more reliable.
def batch_norm(Z, gamma, beta, epsilon=1e-8):
"""
Z: (n, m) — pre-activation values for one layer, one mini-batch
gamma: (n, 1) — learned scale parameter
beta: (n, 1) — learned shift parameter (NOT the β from Adam!)
"""
# Step 1: compute mean and variance across the batch dimension
mu = np.mean(Z, axis=1, keepdims=True) # (n, 1)
var = np.var(Z, axis=1, keepdims=True) # (n, 1)
# Step 2: normalize
Z_norm = (Z - mu) / np.sqrt(var + epsilon)
# Step 3: scale and shift (learnable)
Z_tilde = gamma * Z_norm + beta
return Z_tilde, mu, var
Batch norm is inserted after the linear transform, before the activation: Z = WA + b → BatchNorm(Z) → ReLU. With batch norm, you can drop the bias term b since batch norm subtracts the mean anyway.
Early Stopping
Monitor validation loss during training. When it stops improving (starts going back up), stop training. Simple and effective, but Ng notes it "couples" the optimization objective (fitting training data) with the regularization objective (not overfitting), which can make debugging harder.
Data Augmentation
For images: random flips, rotations, crops, color jitter, scaling. You're saying to the model: "a cat flipped horizontally is still a cat." This effectively multiplies your training data without collecting new examples.
| Technique | When to Use | Note |
|---|---|---|
| L2 / Weight decay | Always (usually built into optimizer) | Typical: 1e-4 to 1e-2 |
| Dropout | Large fully-connected layers | Typical p: 0.2–0.5 |
| Batch norm | Almost always in CNNs | Also speeds up training |
| Early stopping | When you have a validation set | Easy to implement |
| Data augmentation | Images, audio, text | Free data! |
| Mixup | Image classification | Blend two images + labels |
Convolutional Neural Networks
A 256x256 color image has 196,608 pixels. A fully connected first layer with 1000 neurons would need ~197 million weights — just for layer one. That's insane. And it ignores spatial structure: pixel (0,0) is treated the same as pixel (255,255).
CNNs solve this with two key ideas: local connectivity (each neuron only looks at a small region) and weight sharing (the same filter is applied across the entire image).
The Convolution Operation
A filter (or kernel) is a small matrix (e.g., 3x3) that slides across the image. At each position, it computes a dot product — "how much does this patch match my pattern?" The result is a feature map.
Think of it as a flashlight scanning a dark room. The flashlight (filter) illuminates a small area at a time. A "vertical edge detector" filter lights up when it finds a vertical edge. A "horizontal edge detector" finds horizontal edges. Stack many filters and you detect many features simultaneously.
import numpy as np
def conv2d(image, kernel, stride=1, padding=0):
"""Simple 2D convolution (single channel, single filter)"""
if padding > 0:
image = np.pad(image, padding, mode='constant')
h, w = image.shape
kh, kw = kernel.shape
out_h = (h - kh) // stride + 1
out_w = (w - kw) // stride + 1
output = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
region = image[i*stride:i*stride+kh, j*stride:j*stride+kw]
output[i, j] = np.sum(region * kernel)
return output
# Vertical edge detector
image = np.array([
[10, 10, 0, 0],
[10, 10, 0, 0],
[10, 10, 0, 0],
[10, 10, 0, 0],
], dtype=float)
vertical_edge = np.array([
[1, 0, -1],
[1, 0, -1],
[1, 0, -1],
], dtype=float)
edges = conv2d(image, vertical_edge)
print(edges)
# [[30, 0],
# [30, 0]] — detected the vertical edge!
Key Concepts
- Stride: how many pixels the filter moves each step. Stride 2 halves the spatial dimensions.
- Padding: adding zeros around the border. "Same" padding preserves dimensions; "valid" (no padding) shrinks them.
- Receptive field: how much of the original input each neuron "sees." Deeper layers have larger receptive fields — they see more context.
- Number of parameters: a 3x3 filter with 64 input channels and 128 output filters has 3 × 3 × 64 × 128 + 128 = 73,856 weights. Compare to ~197M for a fully connected layer!
Pooling
Max pooling: take the maximum value in each region. Provides translation invariance ("the feature is somewhere in this area") and reduces spatial dimensions.
Average pooling: take the average. Often used in the final layer (global average pooling replaces the fully connected layer).
CNN Architecture Pattern
The classic flow: Conv → ReLU → Pool → ... → Flatten → Dense → Output
As you go deeper: spatial dimensions shrink (via stride/pooling) while channel count grows (more filters). 224 × 224 × 3 → 112 × 112 × 64 → 56 × 56 × 128 → ... → 7 × 7 × 512 → flatten → dense → output.
Landmark Architectures
| Architecture | Year | Key Idea | Depth |
|---|---|---|---|
| LeNet-5 | 1998 | The pioneer — conv + pool + FC | 5 |
| AlexNet | 2012 | ReLU, dropout, GPU training | 8 |
| VGGNet | 2014 | "Just stack 3x3 convs" | 16-19 |
| GoogLeNet/Inception | 2014 | Parallel filters of different sizes | 22 |
| ResNet | 2015 | Skip connections | 50-152 |
| EfficientNet | 2019 | Compound scaling (width + depth + resolution) | varies |
ResNet: The Skip Connection Revolution
Before ResNet, making networks deeper actually hurt performance — not just from overfitting, but from optimization difficulty. A 56-layer network performed worse than a 20-layer one.
The fix is brilliantly simple: skip connections (residual connections). Instead of learning the full mapping H(x), learn the residual F(x) = H(x) - x. The output becomes:
a[l+2] = ReLU(z[l+2] + a[l])
Analogy: instead of building a bridge from scratch, start with the existing road (the identity, x) and learn what modifications to add. If the optimal modification is "nothing" (the identity mapping), the network just learns F(x) = 0, which is easy. Without skip connections, learning the identity through multiple non-linear layers is surprisingly hard.
1x1 Convolutions
A 1x1 convolution seems pointless — a filter that only looks at one pixel? But remember: it operates across channels. A 1x1 conv with 64 input channels and 32 output channels is a learned linear combination that reduces dimensionality. GoogLeNet uses this extensively to keep computation manageable.
Transfer Learning
The most important practical technique in modern deep learning. Instead of training from scratch:
- Take a model pre-trained on a large dataset (ImageNet, 14M images)
- Replace the final classification layer with one for your task
- Fine-tune on your smaller dataset
Why it works: early layers learn universal features (edges, textures, shapes) that transfer across tasks. A cat detector and a car detector both need edge detection. Only the final layers are task-specific.
Don't just fine-tune the last layer and call it done. Jeremy Howard's approach:
- Freeze all pre-trained layers. Train only the new head for 1-3 epochs.
- Unfreeze the last group of layers. Train with a lower learning rate for earlier layers than later layers (discriminative LRs — e.g., 1e-5 for early layers, 1e-3 for the head).
- Optionally unfreeze everything with even more aggressive discriminative rates.
The intuition: early layers have good universal features that need only gentle adjustment. Later layers need more aggressive learning to adapt to your task. This consistently outperforms the naive "freeze everything except the last layer" approach.
Sequence Models
Text, speech, time series, DNA sequences — data where order matters. "Dog bites man" and "Man bites dog" have the same words but very different meanings. Standard neural networks treat inputs as fixed-size, unordered vectors. Sequence models understand order.
Recurrent Neural Networks (RNNs)
An RNN processes one element at a time, maintaining a hidden state that serves as "memory" of what it's seen so far. At each time step:
ht = tanh(Whh · ht-1 + Wxh · xt + b)
Think of it as reading a book word by word. After each word, you update your mental summary (hidden state) of the story so far. Your understanding of "bank" depends on whether the preceding words were about money or rivers.
import numpy as np
class SimpleRNN:
def __init__(self, input_size, hidden_size, output_size):
# Xavier initialization
scale_h = np.sqrt(1 / hidden_size)
scale_x = np.sqrt(1 / input_size)
self.Whh = np.random.randn(hidden_size, hidden_size) * scale_h
self.Wxh = np.random.randn(hidden_size, input_size) * scale_x
self.Why = np.random.randn(output_size, hidden_size) * scale_h
self.bh = np.zeros((hidden_size, 1))
self.by = np.zeros((output_size, 1))
def forward(self, inputs, h_prev):
"""
inputs: list of (input_size, 1) vectors, one per time step
h_prev: initial hidden state (hidden_size, 1)
"""
h = h_prev
hidden_states = []
for x_t in inputs:
h = np.tanh(self.Whh @ h + self.Wxh @ x_t + self.bh)
hidden_states.append(h)
# Output from final hidden state
y = self.Why @ h + self.by
return y, hidden_states
The Vanishing Gradient Problem
During backpropagation through time (BPTT), gradients are multiplied by Whh at each time step. Over a 100-step sequence, it's like multiplying a number by 0.9 a hundred times: 0.9¹⁰⁰ ≈ 0.0000265. The gradient effectively disappears, and the network can't learn long-range dependencies.
"The cat, which already ate a full meal and was lounging by the window watching the birds, was not hungry." An RNN struggles to connect "cat" to "was" across all those intervening words.
LSTM (Long Short-Term Memory)
LSTMs solve the vanishing gradient problem with a cell state — a "conveyor belt" that runs through the entire sequence. Information can flow along it unchanged, or be modified by three gates:
- Forget gate (f): "What should I forget from the cell state?" Sigmoid output — 0 means forget, 1 means keep.
- Input gate (i): "What new information should I add?" Controls which values to update.
- Output gate (o): "What should I output from the cell state?" Filters the cell state for the hidden state.
def lstm_cell(x_t, h_prev, c_prev, params):
"""
Single LSTM time step.
x_t: (n_x, 1) input
h_prev: (n_h, 1) previous hidden state
c_prev: (n_h, 1) previous cell state
"""
Wf, Wi, Wc, Wo = params['Wf'], params['Wi'], params['Wc'], params['Wo']
bf, bi, bc, bo = params['bf'], params['bi'], params['bc'], params['bo']
concat = np.vstack([h_prev, x_t]) # stack for single matrix multiply
# Gates (all sigmoid → values between 0 and 1)
f = sigmoid(Wf @ concat + bf) # Forget gate: what to erase
i = sigmoid(Wi @ concat + bi) # Input gate: what to write
o = sigmoid(Wo @ concat + bo) # Output gate: what to expose
# Candidate cell state
c_candidate = np.tanh(Wc @ concat + bc)
# Update cell state: forget some old + add some new
c_next = f * c_prev + i * c_candidate
# Hidden state: filtered cell state
h_next = o * np.tanh(c_next)
return h_next, c_next
The cell state acts as a highway: gradients can flow backward through it with minimal decay (the forget gate just multiplies by a value close to 1). This is why LSTMs can capture dependencies across hundreds of time steps.
GRU (Gated Recurrent Unit)
A simplified LSTM with only two gates (reset and update), combining the forget and input gates. Fewer parameters, often comparable performance:
- Update gate (z): how much of the previous state to keep (like forget + input combined)
- Reset gate (r): how much of the previous state to use when computing the new candidate
Bidirectional RNNs
A forward RNN only sees past context. But in "He said 'Teddy bears are great'", understanding "Teddy" requires seeing "bears" (future context). A bidirectional RNN runs two RNNs: one forward, one backward. At each time step, it concatenates both hidden states, giving access to both past and future context.
Practical Deep Learning
Andrew Ng's Course 3 (Structuring Machine Learning Projects) is entirely about practical advice. This section distills the most important lessons.
Train / Dev / Test Splits for Deep Learning
Classical ML used 60/20/20 splits. With modern datasets of millions of examples, you only need a small fraction for dev and test:
- Small data (<10K): 60% train / 20% dev / 20% test
- Medium data (10K-1M): 90% train / 5% dev / 5% test
- Large data (1M+): 98% train / 1% dev / 1% test — even 1% of 10M is 100K examples
The Deep Learning Recipe
Ng's systematic approach when your model isn't performing well:
- Does the model have high bias? (Training error is much worse than human-level)
- Yes → Bigger network, train longer, try different architecture
- Does the model have high variance? (Dev error is much worse than training error)
- Yes → More data, regularization, data augmentation, simpler architecture
- Both? Fix bias first (bigger model), then address variance
Error Analysis
When your model has 10% error, don't just throw more data or a bigger model at it. Manually examine 100 misclassified dev examples and categorize them:
| Error Category | Count | % of Errors |
|---|---|---|
| Blurry images | 32 | 32% |
| Mislabeled ground truth | 25 | 25% |
| Unusual angle/pose | 18 | 18% |
| Small object in frame | 15 | 15% |
| Other | 10 | 10% |
Now you know: fixing blurry images (denoising, better data) addresses 32% of errors. Fixing labels addresses 25%. This 30-minute exercise saves weeks of blind experimentation.
Transfer Learning: When It Works
Transfer from task A to task B works when:
- Task A has much more data than task B
- Low-level features from A are useful for B
- Both have the same input type (images, text, etc.)
Examples: ImageNet → medical imaging (YES), English NLP → French NLP (YES), image classification → speech recognition (NO — different input type).
Hyperparameter Tuning
Ng's priority ordering for hyperparameters:
- Learning rate (most important by far)
- Mini-batch size, number of hidden units
- Number of layers, learning rate decay
- Adam parameters β₁, β₂, ε (rarely need to change from defaults)
End-to-End Deep Learning
Should you replace a multi-stage pipeline (audio → phonemes → words → sentences) with one end-to-end model (audio → sentences)?
- Pros: lets the data speak, no hand-designed features, simpler system
- Cons: needs much more data, excludes potentially useful hand-designed knowledge
Ng's advice: use end-to-end when you have lots of data for the complete input-output mapping. When data is limited, a pipeline with hand-designed intermediate steps often works better.
Generative Models
Everything so far has been discriminative: given input, predict a label. Generative models learn to create new data that looks like the training data — new images, music, text.
Autoencoders
An encoder compresses the input into a small bottleneck vector (the latent representation), and a decoder reconstructs the original input from it. The bottleneck forces the network to learn a compact, meaningful representation.
Use cases: dimensionality reduction, denoising (train to reconstruct clean images from noisy ones), pre-training feature extractors.
Variational Autoencoders (VAEs)
Standard autoencoders learn discrete points in latent space. VAEs learn a smooth probability distribution instead. This means you can sample from the distribution and generate new, plausible examples. Nearby points in latent space produce similar outputs — smooth interpolation between a cat and a dog, or between a "5" and a "3".
Generative Adversarial Networks (GANs)
The "counterfeiter vs detective" game:
- Generator (G): creates fake images from random noise. Its goal: fool the discriminator.
- Discriminator (D): receives real images and fakes, tries to tell them apart. Its goal: catch the fakes.
They train simultaneously. G gets better at creating fakes, D gets better at catching them, pushing G to produce increasingly realistic outputs. At equilibrium, D can't tell real from fake — G has learned to generate convincingly realistic data.
- Mode collapse: G learns to produce only a few types of outputs, ignoring the diversity of real data
- Training oscillation: G and D "chase" each other without converging
- Vanishing gradients: if D becomes too strong, G gets no useful signal
Diffusion Models
The latest breakthrough in image generation (DALL-E 2, Stable Diffusion, Midjourney). The idea:
- Forward process: gradually add Gaussian noise to a real image over many steps until it's pure noise.
- Reverse process: train a neural network to reverse each noise step — given a noisy image, predict the slightly-less-noisy version.
To generate a new image: start from pure noise and iteratively denoise. Each step is a small, tractable problem, and the results are stunning.
Diffusion models produce higher-quality, more diverse outputs than GANs, with more stable training. The trade-off: they're slower to generate (many denoising steps vs. one forward pass for GANs).
What's Next
Deep learning is the foundation. Here's where to go from here:
Transformers & LLMs
The attention mechanism allows models to look at all positions in a sequence simultaneously, solving the long-range dependency problem that plagued RNNs. Transformers are the backbone of GPT, BERT, and all modern language models. See the LLMs refresher.
Framework Mastery
Everything in this page was numpy from scratch — great for understanding, but you'll use frameworks for real work. PyTorch provides automatic differentiation, GPU acceleration, and pre-built components. See the PyTorch refresher.
Reinforcement Learning
Learning from rewards instead of labels. An agent takes actions in an environment, receives rewards, and learns a policy to maximize cumulative reward. Key concepts:
- Q-learning: learn the value of each (state, action) pair
- Policy gradients: directly optimize the policy (mapping from states to actions)
- Actor-critic: combine value estimation with policy optimization
Applications: game playing (AlphaGo), robotics, RLHF for language models.
Current Frontiers
- Self-supervised learning: learn from unlabeled data by predicting parts of the input (masked language modeling, contrastive learning). This is how foundation models are trained.
- Multimodal models: jointly understanding images, text, audio, and video (CLIP, GPT-4V, Gemini).
- Efficient architectures: making models smaller and faster without losing quality (distillation, quantization, pruning, mixture of experts).
- AI agents: combining LLMs with tool use, memory, and planning for autonomous task completion.
Quick Reference: Deep Learning Decision Guide
| Task | Architecture | Start With |
|---|---|---|
| Image classification | CNN (ResNet, EfficientNet) | Pre-trained + fine-tune |
| Object detection | YOLO, Faster R-CNN | Pre-trained backbone |
| Text classification | Transformer (BERT) | Pre-trained + fine-tune |
| Text generation | Autoregressive LM (GPT) | API or fine-tuned model |
| Time series | LSTM / Temporal CNN | LSTM baseline |
| Tabular data | Gradient boosted trees first! | XGBoost / LightGBM |
| Image generation | Diffusion model | Stable Diffusion |
| Speech recognition | Transformer (Whisper) | Pre-trained Whisper |