Machine Learning Refresher — Regression, SVMs

Table of Contents

What is Machine Learning?

Arthur Samuel (1959) defined machine learning as the "field of study that gives computers the ability to learn without being explicitly programmed." Instead of writing rules by hand, you give the machine data and let it figure out the patterns.

Think of it this way: teaching a child to recognize cats. You don't hand them a 200-page manual of cat rules ("pointy ears, whiskers, fur..."). You show them hundreds of pictures of cats and not-cats, and they just learn. Machine learning works the same way.

Tom Mitchell gave a more precise definition: "A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E."

For a spam filter: T = classifying emails, E = watching you label emails as spam/not-spam, P = fraction of emails correctly classified.

Types of Machine Learning

Type	You Give It	It Learns	Examples
Supervised	Inputs + correct outputs	Input → output mapping	Spam detection, house prices, image labels
Unsupervised	Inputs only (no labels)	Hidden structure in data	Customer segments, anomaly detection
Reinforcement	Environment + reward signal	Actions that maximize reward	Game playing, robotics, ad placement

This refresher focuses on supervised and unsupervised learning — the bread and butter of applied ML.

The ML Workflow

Every ML project follows roughly the same loop:

Define the problem — What are you predicting? What data do you have?
Collect & explore data — Understand distributions, spot issues
Prepare features — Clean, transform, engineer features
Train a model — Pick an algorithm, fit it to training data
Evaluate — How does it perform on unseen data?
Iterate — Improve based on error analysis (the most important step!)

Ng's advice: start simple

Don't start with the fanciest model. Build a simple baseline fast, then use error analysis to decide what to improve. You'll often be surprised — a simple model with good data beats a complex model with bad data.

Linear Regression

Imagine you have data on house sizes and their selling prices. You want to predict the price of a new house given its size. This is the classic regression problem — predicting a continuous number.

The Hypothesis

We model the relationship as a straight line:

h(x) = θ₀ + θ₁x

Where θ₀ is the y-intercept (base price) and θ₁ is the slope (price per square foot). Our job: find the θ values that make the line fit the data best.

Cost Function

How do we measure "best fit"? We use the mean squared error (MSE):

J(θ) = (1/2m) Σ (h(x⁽ⁱ⁾) - y⁽ⁱ⁾)²

For each training example, compute the difference between our prediction and the actual value, square it (so positive and negative errors don't cancel out), and average over all examples. The 1/2 is just a convenience for calculus later.

Picture this cost function as a bowl — it's a smooth, convex surface. Every point on the bowl represents a pair of (θ₀, θ₁) values. The bottom of the bowl is where J is minimized — that's where we want to be.

Gradient Descent

Gradient descent is how we find the bottom of the bowl. The idea: stand somewhere on the bowl, look around, and take a step in the steepest downhill direction. Repeat.

The update rule:

# Repeat until convergence:
# θⱼ := θⱼ - α * ∂J/∂θⱼ
#
# Where α (alpha) is the learning rate — how big each step is.
# The partial derivative ∂J/∂θⱼ tells us the direction of steepest ascent,
# so we subtract it to go downhill.

import numpy as np

def gradient_descent(X, y, theta, alpha, iterations):
    """
    X: (m, n+1) matrix with 1s column prepended
    y: (m,) target vector
    theta: (n+1,) parameter vector
    alpha: learning rate
    iterations: number of steps
    """
    m = len(y)
    cost_history = []

    for _ in range(iterations):
        # Predictions for all examples at once (vectorized)
        predictions = X @ theta
        errors = predictions - y

        # Gradient: average of (error * feature) across all examples
        gradient = (1 / m) * (X.T @ errors)

        # Update parameters — step downhill
        theta = theta - alpha * gradient

        # Track cost
        cost = (1 / (2 * m)) * np.sum(errors ** 2)
        cost_history.append(cost)

    return theta, cost_history

Choosing the learning rate α

Too large: gradient descent overshoots and diverges (cost goes up). Too small: convergence is painfully slow. Ng's advice: try α = 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1.0 — roughly 3x jumps. Plot J(θ) vs iteration and check it's decreasing.

Multiple Features

Houses have more than just size — bedrooms, age, location. With n features:

h(x) = θ₀ + θ₁x₁ + θ₂x₂ + ... + θₙxₙ = θᵀx

The vectorized form θᵀx makes it efficient. The same gradient descent algorithm works — just with more dimensions in the bowl.

Feature Scaling

Imagine x₁ = house size (0–5000 sq ft) and x₂ = number of bedrooms (1–5). The cost function contours become extremely elongated ellipses. Gradient descent bounces back and forth along the narrow axis, taking forever.

Fix: normalize features to similar ranges. Two common approaches:

# Mean normalization: (x - mean) / range
# Standardization: (x - mean) / std  (most common)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # use train stats!

Always use training set statistics

When you scale test data, use the mean and std from the training set. Otherwise you're leaking information from the test set.

Normal Equation

There's actually a closed-form solution — no iteration needed:

θ = (XᵀX)⁻¹ Xᵀy

Gradient Descent	Normal Equation
Need to choose α	No α needed
Many iterations	One computation
Works well even with many features	Slow if n > 10,000 (matrix inverse is O(n³))
Scales to huge datasets	Needs all data in memory

scikit-learn

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate toy data: price ≈ 100 * size + 50 * bedrooms + noise
np.random.seed(42)
m = 200
size = np.random.uniform(500, 3000, m)
beds = np.random.randint(1, 6, m)
price = 100 * size + 50000 * beds + np.random.normal(0, 20000, m)

X = np.column_stack([size, beds])
y = price

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

print(f"Coefficients: {model.coef_}")        # ~[100, 50000]
print(f"Intercept: {model.intercept_:.0f}")
print(f"R² score: {model.score(X_test, y_test):.3f}")
print(f"RMSE: {mean_squared_error(y_test, model.predict(X_test), squared=False):.0f}")

Logistic Regression

Now imagine classifying emails as spam or not-spam. The output isn't a continuous number — it's a yes/no decision. This is classification.

You might think: just use linear regression and threshold at 0.5. But that breaks badly — a single outlier far from the boundary can shift the line and ruin predictions for everything else. We need a function that naturally outputs values between 0 and 1.

The Sigmoid Function

Enter the sigmoid (logistic) function:

σ(z) = 1 / (1 + e⁻ᶻ)

It takes any real number and squashes it to (0, 1) — perfect for probabilities. Large positive z → ~1, large negative z → ~0, z = 0 → exactly 0.5.

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Hypothesis: probability that y = 1
# h(x) = σ(θᵀx) = P(y=1 | x; θ)
def predict_proba(X, theta):
    return sigmoid(X @ theta)

# Decision: classify as 1 if P >= 0.5
def predict(X, theta):
    return (predict_proba(X, theta) >= 0.5).astype(int)

Cost Function: Log Loss

We can't use MSE here — plugging sigmoid into MSE creates a non-convex surface with many local minima. Gradient descent would get stuck.

Instead, we use log loss (binary cross-entropy):

J(θ) = -(1/m) Σ [y⁽ⁱ⁾ log(h(x⁽ⁱ⁾)) + (1-y⁽ⁱ⁾) log(1 - h(x⁽ⁱ⁾))]

The intuition: if the true label is 1 and your model predicts 0.99, the cost is tiny (-log(0.99) ≈ 0.01). But if it predicts 0.01, the cost is enormous (-log(0.01) ≈ 4.6). The penalty grows dramatically as your confidence points the wrong way.

def log_loss(X, y, theta):
    m = len(y)
    h = sigmoid(X @ theta)
    # Clip to avoid log(0)
    h = np.clip(h, 1e-10, 1 - 1e-10)
    return -(1 / m) * (y @ np.log(h) + (1 - y) @ np.log(1 - h))

def logistic_gradient_descent(X, y, theta, alpha, iterations):
    m = len(y)
    for _ in range(iterations):
        h = sigmoid(X @ theta)
        gradient = (1 / m) * (X.T @ (h - y))
        theta = theta - alpha * gradient
    return theta

Same gradient, different hypothesis

The gradient update for logistic regression looks identical to linear regression: θ := θ - α(1/m) Xᵀ(h-y). The difference is that h(x) = σ(θᵀx) instead of h(x) = θᵀx. This isn't a coincidence — it falls out of the math when you derive the gradient of log loss.

Decision Boundaries

The decision boundary is where h(x) = 0.5, which means θᵀx = 0. With two features, this is a line. With polynomial features (x₁², x₁x₂, x₂²...), you get curves, circles, or any shape — at the cost of more parameters and potential overfitting.

Multiclass Classification

What if you have more than two classes (e.g., cat, dog, bird)? Use one-vs-all (OvA): train K separate binary classifiers, each asking "is this class k or not?", then pick the class with the highest probability.

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# multi_class='ovr' for one-vs-rest (default for binary, but explicit here)
model = LogisticRegression(multi_class='ovr', max_iter=200)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.3f}")

# Softmax alternative: multi_class='multinomial' generalizes log loss to K classes
model_softmax = LogisticRegression(multi_class='multinomial', max_iter=200)
model_softmax.fit(X_train, y_train)
print(f"Softmax accuracy: {model_softmax.score(X_test, y_test):.3f}")

Regularization

Imagine fitting a polynomial to data points. A degree-1 polynomial (line) might underfit — too simple to capture the pattern. A degree-15 polynomial passes through every point perfectly but wiggles wildly between them — it overfits. It memorized the training data, including the noise, and will generalize terribly.

The Bias-Variance Tradeoff

This is one of the most fundamental ideas in ML:

High bias (underfitting): model is too simple. Both training and test errors are high. "I didn't study enough for the exam."
High variance (overfitting): model is too complex. Training error is low, but test error is much higher. "I memorized the practice exam answers but can't solve new problems."

The sweet spot is in between. Regularization helps us get there by penalizing complexity.

L2 Regularization (Ridge)

Add a penalty for large weights to the cost function:

J(θ) = (1/2m) Σ(h(x⁽ⁱ⁾) - y⁽ⁱ⁾)² + (λ/2m) Σθⱼ²

The λ term controls the strength of regularization. Higher λ → simpler model (weights shrink toward zero). Note: we don't regularize θ₀ (the bias term).

Intuition: L2 says "keep the weights small." Small weights mean the model relies on gentle combinations of features rather than putting all its eggs in one basket.

L1 Regularization (Lasso)

L1 uses the absolute value of weights instead of squares:

J(θ) = (1/2m) Σ(h(x⁽ⁱ⁾) - y⁽ⁱ⁾)² + (λ/2m) Σ|θⱼ|

The key difference: L1 tends to drive some weights exactly to zero, effectively performing feature selection. L2 makes all weights small but rarely exactly zero.

Property	L1 (Lasso)	L2 (Ridge)	Elastic Net
Penalty	Σ\|θⱼ\|	Σθⱼ²	α·Σ\|θⱼ\| + (1-α)·Σθⱼ²
Sparsity	Yes (zeros out features)	No (shrinks, never zero)	Some sparsity
Feature selection	Built-in	No	Partial
Correlated features	Picks one, ignores rest	Keeps all, shrinks equally	Groups them
When to use	Many irrelevant features	All features somewhat relevant	Correlated feature groups

from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Ridge (L2) — alpha is the λ parameter
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Lasso (L1) — notice some coefficients become exactly 0
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
print(f"Lasso zeros: {sum(lasso.coef_ == 0)} of {len(lasso.coef_)}")

# Elastic Net (L1 + L2)
enet = ElasticNet(alpha=0.1, l1_ratio=0.5)  # l1_ratio controls L1 vs L2 mix
enet.fit(X_train, y_train)

Choosing λ (regularization strength)

Use cross-validation. Try a range of values (0.001, 0.01, 0.1, 1, 10, 100) and pick the one with the best validation score. scikit-learn provides RidgeCV and LassoCV that do this automatically.

Neural Networks Intro

What if the decision boundary isn't a straight line or even a simple curve? What if the relationship between features and output is deeply non-linear? This is where neural networks shine.

The Neuron

A single neuron is just logistic regression: take a weighted sum of inputs, pass through an activation function, and output a number. Nothing mysterious.

a = σ(wᵀx + b)

The magic happens when you stack neurons into layers. Each layer transforms the data, creating increasingly abstract representations.

Architecture

Think of it as an assembly line:

Input layer: raw features (pixel values, word counts, etc.)
Hidden layers: each layer learns to detect patterns. Early layers find simple patterns (edges in images, common word pairs), later layers combine them into complex concepts (faces, sentiment).
Output layer: the final prediction (class probabilities, regression value).

Analogy: recognizing a face. Layer 1 detects edges. Layer 2 combines edges into eyes, noses, mouths. Layer 3 combines features into "this looks like a face." Each layer builds on the previous one.

Forward Propagation

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def relu(z):
    return np.maximum(0, z)

# Simple 2-layer network (1 hidden layer)
# Input: 3 features, Hidden: 4 neurons, Output: 1
np.random.seed(42)
W1 = np.random.randn(4, 3) * 0.01  # (hidden_size, input_size)
b1 = np.zeros((4, 1))
W2 = np.random.randn(1, 4) * 0.01  # (output_size, hidden_size)
b2 = np.zeros((1, 1))

def forward(X):
    """X is (3, m) — 3 features, m examples"""
    Z1 = W1 @ X + b1      # linear transform
    A1 = relu(Z1)          # activation
    Z2 = W2 @ A1 + b2      # linear transform
    A2 = sigmoid(Z2)       # output probability
    return A2

# Example: single input
x = np.array([[1.0], [0.5], [-0.3]])  # (3, 1)
print(f"Prediction: {forward(x)[0, 0]:.4f}")

Why hidden layers matter

A single-layer network (logistic regression) can only learn linear decision boundaries. Add one hidden layer with enough neurons, and theoretically it can approximate any function (the Universal Approximation Theorem). In practice, deeper networks learn hierarchical features more efficiently than wide shallow ones.

Go deeper

This section introduces neural networks conceptually. See the Deep Learning refresher for backpropagation, optimization, regularization, CNNs, RNNs, and practical advice.

Support Vector Machines

SVMs approach classification differently: instead of finding any boundary between classes, find the boundary with the maximum margin — the widest possible "street" separating the classes.

The Large Margin Intuition

Picture two groups of points on a 2D plane. Many lines could separate them. SVM picks the line that's as far as possible from both groups. The data points closest to the boundary are called support vectors — they "support" the margin. All other points are irrelevant.

This maximum margin tends to generalize better because it's less sensitive to small perturbations in the data.

The Kernel Trick

What if the data isn't linearly separable? Imagine red and blue points in a circle pattern — no line can separate them. But if you project them into a higher dimension (e.g., add a z = x² + y² feature), suddenly a flat plane can separate them.

The kernel trick lets SVMs compute in this higher-dimensional space without actually doing the expensive projection. Common kernels:

Linear: K(x, z) = xᵀz — just the dot product, equivalent to no projection
Polynomial: K(x, z) = (xᵀz + c)ᵈ — polynomial decision boundary
RBF (Gaussian): K(x, z) = exp(-γ||x-z||²) — the most popular, creates smooth non-linear boundaries

The C Parameter

C controls the trade-off between a wide margin and correctly classifying every training point. Think of it as the inverse of regularization (λ = 1/C):

Large C: narrow margin, fewer misclassifications (risk of overfitting)
Small C: wide margin, allows some misclassifications (more regularization)

Ng's SVM vs Logistic Regression guidelines

n large relative to m (many features, few examples): logistic regression or SVM with linear kernel
n small, m medium (10–10,000 examples): SVM with Gaussian kernel
n small, m large (50,000+ examples): add features, then logistic regression or linear SVM (kernel SVM is too slow on large m)

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_moons

# Non-linearly separable data
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)

# SVM with RBF kernel — always scale features first!
svm_rbf = make_pipeline(StandardScaler(), SVC(kernel='rbf', C=1.0, gamma='scale'))
svm_rbf.fit(X, y)
print(f"RBF accuracy: {svm_rbf.score(X, y):.3f}")

# Linear kernel — faster, works when data is linearly separable
svm_linear = make_pipeline(StandardScaler(), SVC(kernel='linear', C=1.0))
svm_linear.fit(X, y)
print(f"Linear accuracy: {svm_linear.score(X, y):.3f}")

Decision Trees & Ensembles

Decision trees are the "20 questions" of ML. Each node asks a yes/no question about a feature, splitting the data until you reach a prediction. They're intuitive, interpretable, and surprisingly powerful — especially when combined into ensembles.

How Trees Work

At each node, the algorithm picks the feature and threshold that best separates the classes. "Best" is measured by information gain (how much uncertainty is reduced):

Gini impurity: probability of misclassifying a randomly chosen element (scikit-learn default)
Entropy: information-theoretic measure of disorder (used in C4.5, ID3)

Both work well in practice. The tree keeps splitting until a stopping criterion is met (max depth, min samples per leaf, etc.).

Pruning

Without limits, a tree will keep splitting until every leaf has one example — massively overfitting. Pruning prevents this:

Pre-pruning: stop growing early (max_depth, min_samples_split)
Post-pruning: grow full tree, then remove branches that don't improve validation score

Random Forests

"Wisdom of crowds" for ML. Random forests create many decision trees, each trained on a random subset of data (bagging) and a random subset of features. Their predictions are averaged (regression) or voted (classification).

Why it works: individual trees are noisy and overfit in different ways. Averaging them cancels out the noise while preserving the signal.

Gradient Boosting

Instead of building trees independently (like random forests), boosting builds them sequentially. Each new tree focuses on the mistakes of the previous ones. It's like a student reviewing the questions they got wrong.

XGBoost: the "default winner" for structured/tabular data competitions
LightGBM: faster, handles large datasets, leaf-wise growth
CatBoost: handles categorical features natively

fast.ai insight: trees are underrated

Jeremy Howard argues that for tabular data, gradient boosted trees (XGBoost/LightGBM) should be your first choice, not neural networks. Deep learning shines on images, text, and audio — but for spreadsheet-like data with mixed feature types, trees consistently match or beat deep learning and are faster to train and easier to interpret. Always try a strong tree baseline first.

Algorithm	Best For	Weakness
Linear / Logistic	Linear relationships, interpretability	Can't capture non-linearity
SVM (RBF)	Clean medium-size datasets	Slow on large n, requires scaling
Decision Tree	Interpretability, no scaling needed	Overfits easily
Random Forest	General-purpose, robust	Slower, less interpretable
XGBoost/LightGBM	Tabular data, competitions	Tuning can be complex
Neural Networks	Images, text, audio, huge data	Needs lots of data, compute

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score

X, y = load_wine(return_X_y=True)

models = {
    "Decision Tree": DecisionTreeClassifier(max_depth=5),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"{name:20s}: {scores.mean():.3f} ± {scores.std():.3f}")

Unsupervised Learning

No labels. No right answers. Just data, and the goal is to find hidden structure. This is like looking at a pile of photos and organizing them into groups without anyone telling you the categories.

K-Means Clustering

The most intuitive clustering algorithm. Think of it as the "centroid dance":

Randomly place K centroids in the feature space
Assign each point to its nearest centroid
Move each centroid to the mean of its assigned points
Repeat steps 2-3 until centroids stop moving

import numpy as np

def k_means(X, K, max_iters=100):
    m, n = X.shape
    # Random initialization: pick K random data points as centroids
    indices = np.random.choice(m, K, replace=False)
    centroids = X[indices].copy()

    for _ in range(max_iters):
        # Assign: compute distance from each point to each centroid
        distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)  # (m, K)
        labels = np.argmin(distances, axis=1)  # (m,)

        # Update: move centroids to mean of assigned points
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])

        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids

    return labels, centroids

Choosing K

Elbow method: plot cost (sum of squared distances to assigned centroid) vs K. Look for the "elbow" where adding more clusters gives diminishing returns.
Silhouette score: measures how similar each point is to its own cluster vs nearest neighbor cluster. Range [-1, 1], higher is better.
Domain knowledge: often you know how many segments make sense (e.g., S/M/L t-shirt sizes).

K-means can get stuck

K-means is sensitive to initialization. Run it multiple times with different random seeds and pick the result with the lowest cost. scikit-learn does this by default (n_init=10). Use init='k-means++' for smarter initialization.

Principal Component Analysis (PCA)

Imagine your data lives in 100-dimensional space, but most of the "action" happens along just a few directions. PCA finds those directions — the principal components — and lets you project the data onto them, reducing dimensionality while preserving as much variance as possible.

Intuition: you have a cloud of 3D points that roughly lies on a tilted plane. PCA finds the plane and projects the points onto it, giving you 2D data that captures nearly all the information.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# Always standardize before PCA!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Find principal components
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)

# How much information did we keep?
print(f"Variance explained: {pca.explained_variance_ratio_}")
print(f"Total: {pca.explained_variance_ratio_.sum():.1%}")

# Choose n_components to retain 95% of variance
pca_95 = PCA(n_components=0.95)
X_95 = pca_95.fit_transform(X_scaled)
print(f"Components for 95% variance: {pca_95.n_components_}")

PCA is not feature selection

PCA creates new features (linear combinations of the originals). They're hard to interpret. If you need interpretable feature reduction, use L1 regularization or tree-based feature importance instead.

Anomaly Detection

Given a dataset of "normal" examples, flag anything that looks unusual. Use cases: fraud detection, manufacturing defects, server monitoring.

The simplest approach: fit a Gaussian distribution to each feature, then flag points where the probability p(x) falls below a threshold ε.

p(x) = Π p(xⱼ; μⱼ, σⱼ²) — product of individual feature probabilities

from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

# Gaussian-based (assumes multivariate normal)
ee = EllipticEnvelope(contamination=0.05)  # expect ~5% anomalies
ee.fit(X_train)
anomalies = ee.predict(X_test)  # -1 for anomalies, 1 for normal

# Isolation Forest — works without distributional assumptions
# Idea: anomalies are "easy to isolate" with random splits
iso = IsolationForest(contamination=0.05, random_state=42)
iso.fit(X_train)
anomalies_iso = iso.predict(X_test)

Model Evaluation & Selection

Train / Dev / Test Splits

The exam analogy: training data is the textbook, dev (validation) data is the practice exam, test data is the final exam. You study from the textbook, check your understanding on practice exams, and prove your knowledge on the final.

Small datasets (< 10,000): 60% train / 20% dev / 20% test
Large datasets (1M+): 98% train / 1% dev / 1% test (1% of 1M is still 10,000 — plenty)

Never touch the test set until you're done

If you tune your model on the test set — even indirectly — your reported performance is optimistic. The test set is the final exam. You only look at it once.

Cross-Validation

When data is limited, k-fold cross-validation gives a more reliable estimate. Split data into k folds, train on k-1 folds, evaluate on the held-out fold. Repeat k times and average.

from sklearn.model_selection import cross_val_score, StratifiedKFold

# 5-fold cross-validation — stratified preserves class proportions
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
print(f"CV accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

Diagnosing Bias vs Variance

This is one of the most important skills in applied ML. Look at training error and dev error:

Training Error	Dev Error	Diagnosis	What to Try
High (15%)	High (16%)	High bias (underfit)	Bigger model, more features, less regularization
Low (1%)	High (11%)	High variance (overfit)	More data, more regularization, simpler model
High (15%)	High (30%)	Both	Bigger model AND more data/regularization
Low (1%)	Low (2%)	Good fit	Ship it!

Classification Metrics

Accuracy alone is often misleading. If 99% of emails are not spam, a model that always predicts "not spam" has 99% accuracy — but it's useless.

Metric	Formula	Measures	Use When
Precision	TP / (TP + FP)	"Of things I flagged, how many were right?"	False positives are costly (spam filter)
Recall	TP / (TP + FN)	"Of all positives, how many did I catch?"	False negatives are costly (cancer screening)
F1 Score	2·P·R / (P+R)	Harmonic mean of P and R	Need to balance both
AUC-ROC	Area under ROC curve	Performance across all thresholds	Comparing models overall

from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score
)

y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]  # for AUC

print(classification_report(y_test, y_pred))
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.3f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
# [[TN, FP],
#  [FN, TP]]

Recommender Systems

Netflix, Spotify, Amazon — they all need to answer: "given what we know about this user, what should we show them next?" Two main approaches:

Content-Based Filtering

Recommend items similar to what the user already liked. If you watched three sci-fi movies, recommend more sci-fi. Each item has a feature vector (genre, actors, director...), and you build a model per user that predicts their rating based on item features.

Pro: works for new items (no cold-start for items). Con: limited to known features, doesn't discover surprising recommendations.

Collaborative Filtering

"Users who liked X also liked Y." You don't need item features at all — just the user-item interaction matrix (ratings, clicks, purchases).

The key idea: matrix factorization. Decompose the giant (mostly empty) user-item matrix into two smaller matrices:

User matrix U: each user is a vector of k latent factors
Item matrix V: each item is a vector of k latent factors
Predicted rating: r̂ᵤᵢ = uᵤᵀ · vᵢ

These latent factors might correspond to "action-ness", "romance-ness", etc., but they're learned automatically — you never tell the model what they mean.

import numpy as np

def collaborative_filter(R, K=10, alpha=0.01, lam=0.1, iters=100):
    """
    Simple matrix factorization for collaborative filtering.
    R: (n_users, n_items) rating matrix. 0 means unrated.
    K: number of latent factors
    """
    n_users, n_items = R.shape
    # Initialize latent factors randomly
    U = np.random.normal(0, 0.1, (n_users, K))
    V = np.random.normal(0, 0.1, (n_items, K))

    # Mask: which entries are observed?
    mask = R > 0

    for _ in range(iters):
        # Predicted ratings
        R_pred = U @ V.T

        # Error only on observed ratings
        error = (R - R_pred) * mask

        # Gradient descent with L2 regularization
        U += alpha * (error @ V - lam * U)
        V += alpha * (error.T @ U - lam * V)

    return U @ V.T  # full predicted rating matrix

Modern approaches

Deep learning-based recommenders (two-tower models, transformers) have largely replaced simple matrix factorization at scale. But the core intuition — learning latent representations of users and items — remains the same.

ML Strategy

This is perhaps Andrew Ng's most valuable contribution: practical advice for what to do when your model isn't working. Most ML courses teach algorithms. This section teaches what to try next.

Error Analysis

When your model makes mistakes, don't just look at the numbers. Manually examine 100 misclassified examples. Categorize the errors:

"30% of errors are blurry images" → invest in denoising or better data
"50% of errors are mislabeled" → fix the labels first
"15% are a rare category" → collect more examples of that category

This 30-minute exercise often saves weeks of blind model tuning.

Ceiling Analysis

For a multi-stage pipeline (e.g., face detection → feature extraction → recognition), manually give each stage perfect output and measure the overall improvement. This tells you which component to improve.

If perfect face detection only improves end-to-end accuracy by 1%, don't waste time on detection — work on recognition instead.

What to Try Next

Your model has high error. You could try many things — but which one? Diagnose first:

Problem	Try These	Don't Bother With
High bias (underfit)	Bigger model, more/better features, train longer, less regularization	More data (won't help!), more regularization
High variance (overfit)	More data, regularization, simpler model, dropout, early stopping	Bigger model (makes it worse), less regularization

Ng's "orthogonalization" principle

Each knob should do one thing. To fix bias: make the model bigger. To fix variance: add regularization or data. Don't confuse the two. Early stopping, for example, violates this — it simultaneously affects both bias and variance, making it harder to diagnose.

Start with a Strong Baseline

Before spending weeks on a complex model:

Human-level performance: how well can a human do this task? This is your ceiling.
Simple baseline: logistic regression, random forest, or even rules. How far does simple get you?
Error analysis: where does the baseline fail? That tells you what a fancier model needs to solve.

fast.ai insight: always start with a baseline

Jeremy Howard emphasizes this even more strongly — get a complete end-to-end pipeline working with the simplest possible model within the first few hours. A working pipeline that you can iterate on is worth more than a perfect model that takes weeks to build.

Data-Centric AI

Ng's latest emphasis: instead of endlessly tuning models, improve the data. In most practical applications:

Cleaning labels gives bigger gains than a fancier model
Adding high-quality data for the hard cases matters more than more data overall
Consistent labeling guidelines help more than labeling more examples

From Prototype to Production

Feature Engineering

Raw data rarely goes straight into a model. You need to transform it:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

# Define transformations per column type
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['occupation', 'city', 'education']

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore')),
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),
])

# Full pipeline: preprocess → model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100)),
])

# Now this single object handles everything
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)

Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_distributions = {
    'classifier__n_estimators': randint(50, 500),
    'classifier__max_depth': randint(3, 20),
    'classifier__min_samples_split': randint(2, 20),
    'classifier__min_samples_leaf': randint(1, 10),
}

search = RandomizedSearchCV(
    pipeline,
    param_distributions,
    n_iter=50,          # try 50 random combinations
    cv=5,               # 5-fold cross-validation
    scoring='f1',       # optimize for F1
    random_state=42,
    n_jobs=-1,          # use all CPU cores
)
search.fit(X_train, y_train)

print(f"Best F1: {search.best_score_:.3f}")
print(f"Best params: {search.best_params_}")

Random search beats grid search

Ng and Bergstra (2012) both advocate random search over grid search. With grid search, if one parameter doesn't matter, you waste evaluations on different values of it. Random search explores more unique values of the parameters that do matter.

Saving & Loading Models

import joblib

# Save the entire pipeline (preprocessing + model)
joblib.dump(search.best_estimator_, 'model_pipeline.joblib')

# Load and predict
loaded_pipeline = joblib.load('model_pipeline.joblib')
predictions = loaded_pipeline.predict(new_data)

Monitoring in Production

Your model will degrade over time as the real world changes. Watch for:

Data drift: input distribution shifts (new user demographics, seasonal patterns)
Concept drift: the relationship between input and output changes (what users want evolves)
Performance degradation: track prediction accuracy on a labeled sample over time

Set up alerts when metrics drop below a threshold, and retrain regularly.

Quick Reference: When to Use What

Scenario	First Try	If That's Not Enough
Tabular, few features	Logistic/Linear Regression	Random Forest → XGBoost
Tabular, many features	XGBoost / LightGBM	Feature selection → Neural Net
Images	Pre-trained CNN (transfer learning)	Fine-tune deeper layers
Text	TF-IDF + Logistic Regression	Pre-trained LLM (see LLMs refresher)
Time series	ARIMA / Prophet	LSTM / Temporal CNN
Anomaly detection	Isolation Forest	Autoencoder
Clustering	K-Means	DBSCAN / Gaussian Mixture