Machine Learning Refresher
From linear regression to production ML — the Andrew Ng way
Table of Contents
What is Machine Learning?
Arthur Samuel (1959) defined machine learning as the "field of study that gives computers the ability to learn without being explicitly programmed." Instead of writing rules by hand, you give the machine data and let it figure out the patterns.
Think of it this way: teaching a child to recognize cats. You don't hand them a 200-page manual of cat rules ("pointy ears, whiskers, fur..."). You show them hundreds of pictures of cats and not-cats, and they just learn. Machine learning works the same way.
Tom Mitchell gave a more precise definition: "A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E."
For a spam filter: T = classifying emails, E = watching you label emails as spam/not-spam, P = fraction of emails correctly classified.
Types of Machine Learning
| Type | You Give It | It Learns | Examples |
|---|---|---|---|
| Supervised | Inputs + correct outputs | Input → output mapping | Spam detection, house prices, image labels |
| Unsupervised | Inputs only (no labels) | Hidden structure in data | Customer segments, anomaly detection |
| Reinforcement | Environment + reward signal | Actions that maximize reward | Game playing, robotics, ad placement |
This refresher focuses on supervised and unsupervised learning — the bread and butter of applied ML.
The ML Workflow
Every ML project follows roughly the same loop:
- Define the problem — What are you predicting? What data do you have?
- Collect & explore data — Understand distributions, spot issues
- Prepare features — Clean, transform, engineer features
- Train a model — Pick an algorithm, fit it to training data
- Evaluate — How does it perform on unseen data?
- Iterate — Improve based on error analysis (the most important step!)
Linear Regression
Imagine you have data on house sizes and their selling prices. You want to predict the price of a new house given its size. This is the classic regression problem — predicting a continuous number.
The Hypothesis
We model the relationship as a straight line:
h(x) = θ₀ + θ₁x
Where θ₀ is the y-intercept (base price) and θ₁ is the slope (price per square foot). Our job: find the θ values that make the line fit the data best.
Cost Function
How do we measure "best fit"? We use the mean squared error (MSE):
J(θ) = (1/2m) Σ (h(x⁽ⁱ⁾) - y⁽ⁱ⁾)²
For each training example, compute the difference between our prediction and the actual value, square it (so positive and negative errors don't cancel out), and average over all examples. The 1/2 is just a convenience for calculus later.
Picture this cost function as a bowl — it's a smooth, convex surface. Every point on the bowl represents a pair of (θ₀, θ₁) values. The bottom of the bowl is where J is minimized — that's where we want to be.
Gradient Descent
Gradient descent is how we find the bottom of the bowl. The idea: stand somewhere on the bowl, look around, and take a step in the steepest downhill direction. Repeat.
The update rule:
# Repeat until convergence:
# θⱼ := θⱼ - α * ∂J/∂θⱼ
#
# Where α (alpha) is the learning rate — how big each step is.
# The partial derivative ∂J/∂θⱼ tells us the direction of steepest ascent,
# so we subtract it to go downhill.
import numpy as np
def gradient_descent(X, y, theta, alpha, iterations):
"""
X: (m, n+1) matrix with 1s column prepended
y: (m,) target vector
theta: (n+1,) parameter vector
alpha: learning rate
iterations: number of steps
"""
m = len(y)
cost_history = []
for _ in range(iterations):
# Predictions for all examples at once (vectorized)
predictions = X @ theta
errors = predictions - y
# Gradient: average of (error * feature) across all examples
gradient = (1 / m) * (X.T @ errors)
# Update parameters — step downhill
theta = theta - alpha * gradient
# Track cost
cost = (1 / (2 * m)) * np.sum(errors ** 2)
cost_history.append(cost)
return theta, cost_history
Multiple Features
Houses have more than just size — bedrooms, age, location. With n features:
h(x) = θ₀ + θ₁x₁ + θ₂x₂ + ... + θₙxₙ = θᵀx
The vectorized form θᵀx makes it efficient. The same gradient descent algorithm works — just with more dimensions in the bowl.
Feature Scaling
Imagine x₁ = house size (0–5000 sq ft) and x₂ = number of bedrooms (1–5). The cost function contours become extremely elongated ellipses. Gradient descent bounces back and forth along the narrow axis, taking forever.
Fix: normalize features to similar ranges. Two common approaches:
# Mean normalization: (x - mean) / range
# Standardization: (x - mean) / std (most common)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # use train stats!
Normal Equation
There's actually a closed-form solution — no iteration needed:
θ = (XᵀX)⁻¹ Xᵀy
| Gradient Descent | Normal Equation |
|---|---|
| Need to choose α | No α needed |
| Many iterations | One computation |
| Works well even with many features | Slow if n > 10,000 (matrix inverse is O(n³)) |
| Scales to huge datasets | Needs all data in memory |
scikit-learn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate toy data: price ≈ 100 * size + 50 * bedrooms + noise
np.random.seed(42)
m = 200
size = np.random.uniform(500, 3000, m)
beds = np.random.randint(1, 6, m)
price = 100 * size + 50000 * beds + np.random.normal(0, 20000, m)
X = np.column_stack([size, beds])
y = price
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
print(f"Coefficients: {model.coef_}") # ~[100, 50000]
print(f"Intercept: {model.intercept_:.0f}")
print(f"R² score: {model.score(X_test, y_test):.3f}")
print(f"RMSE: {mean_squared_error(y_test, model.predict(X_test), squared=False):.0f}")
Logistic Regression
Now imagine classifying emails as spam or not-spam. The output isn't a continuous number — it's a yes/no decision. This is classification.
You might think: just use linear regression and threshold at 0.5. But that breaks badly — a single outlier far from the boundary can shift the line and ruin predictions for everything else. We need a function that naturally outputs values between 0 and 1.
The Sigmoid Function
Enter the sigmoid (logistic) function:
σ(z) = 1 / (1 + e⁻ᶻ)
It takes any real number and squashes it to (0, 1) — perfect for probabilities. Large positive z → ~1, large negative z → ~0, z = 0 → exactly 0.5.
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Hypothesis: probability that y = 1
# h(x) = σ(θᵀx) = P(y=1 | x; θ)
def predict_proba(X, theta):
return sigmoid(X @ theta)
# Decision: classify as 1 if P >= 0.5
def predict(X, theta):
return (predict_proba(X, theta) >= 0.5).astype(int)
Cost Function: Log Loss
We can't use MSE here — plugging sigmoid into MSE creates a non-convex surface with many local minima. Gradient descent would get stuck.
Instead, we use log loss (binary cross-entropy):
J(θ) = -(1/m) Σ [y⁽ⁱ⁾ log(h(x⁽ⁱ⁾)) + (1-y⁽ⁱ⁾) log(1 - h(x⁽ⁱ⁾))]
The intuition: if the true label is 1 and your model predicts 0.99, the cost is tiny (-log(0.99) ≈ 0.01). But if it predicts 0.01, the cost is enormous (-log(0.01) ≈ 4.6). The penalty grows dramatically as your confidence points the wrong way.
def log_loss(X, y, theta):
m = len(y)
h = sigmoid(X @ theta)
# Clip to avoid log(0)
h = np.clip(h, 1e-10, 1 - 1e-10)
return -(1 / m) * (y @ np.log(h) + (1 - y) @ np.log(1 - h))
def logistic_gradient_descent(X, y, theta, alpha, iterations):
m = len(y)
for _ in range(iterations):
h = sigmoid(X @ theta)
gradient = (1 / m) * (X.T @ (h - y))
theta = theta - alpha * gradient
return theta
Decision Boundaries
The decision boundary is where h(x) = 0.5, which means θᵀx = 0. With two features, this is a line. With polynomial features (x₁², x₁x₂, x₂²...), you get curves, circles, or any shape — at the cost of more parameters and potential overfitting.
Multiclass Classification
What if you have more than two classes (e.g., cat, dog, bird)? Use one-vs-all (OvA): train K separate binary classifiers, each asking "is this class k or not?", then pick the class with the highest probability.
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# multi_class='ovr' for one-vs-rest (default for binary, but explicit here)
model = LogisticRegression(multi_class='ovr', max_iter=200)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.3f}")
# Softmax alternative: multi_class='multinomial' generalizes log loss to K classes
model_softmax = LogisticRegression(multi_class='multinomial', max_iter=200)
model_softmax.fit(X_train, y_train)
print(f"Softmax accuracy: {model_softmax.score(X_test, y_test):.3f}")
Regularization
Imagine fitting a polynomial to data points. A degree-1 polynomial (line) might underfit — too simple to capture the pattern. A degree-15 polynomial passes through every point perfectly but wiggles wildly between them — it overfits. It memorized the training data, including the noise, and will generalize terribly.
The Bias-Variance Tradeoff
This is one of the most fundamental ideas in ML:
- High bias (underfitting): model is too simple. Both training and test errors are high. "I didn't study enough for the exam."
- High variance (overfitting): model is too complex. Training error is low, but test error is much higher. "I memorized the practice exam answers but can't solve new problems."
The sweet spot is in between. Regularization helps us get there by penalizing complexity.
L2 Regularization (Ridge)
Add a penalty for large weights to the cost function:
J(θ) = (1/2m) Σ(h(x⁽ⁱ⁾) - y⁽ⁱ⁾)² + (λ/2m) Σθⱼ²
The λ term controls the strength of regularization. Higher λ → simpler model (weights shrink toward zero). Note: we don't regularize θ₀ (the bias term).
Intuition: L2 says "keep the weights small." Small weights mean the model relies on gentle combinations of features rather than putting all its eggs in one basket.
L1 Regularization (Lasso)
L1 uses the absolute value of weights instead of squares:
J(θ) = (1/2m) Σ(h(x⁽ⁱ⁾) - y⁽ⁱ⁾)² + (λ/2m) Σ|θⱼ|
The key difference: L1 tends to drive some weights exactly to zero, effectively performing feature selection. L2 makes all weights small but rarely exactly zero.
| Property | L1 (Lasso) | L2 (Ridge) | Elastic Net |
|---|---|---|---|
| Penalty | Σ|θⱼ| | Σθⱼ² | α·Σ|θⱼ| + (1-α)·Σθⱼ² |
| Sparsity | Yes (zeros out features) | No (shrinks, never zero) | Some sparsity |
| Feature selection | Built-in | No | Partial |
| Correlated features | Picks one, ignores rest | Keeps all, shrinks equally | Groups them |
| When to use | Many irrelevant features | All features somewhat relevant | Correlated feature groups |
from sklearn.linear_model import Ridge, Lasso, ElasticNet
# Ridge (L2) — alpha is the λ parameter
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
# Lasso (L1) — notice some coefficients become exactly 0
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
print(f"Lasso zeros: {sum(lasso.coef_ == 0)} of {len(lasso.coef_)}")
# Elastic Net (L1 + L2)
enet = ElasticNet(alpha=0.1, l1_ratio=0.5) # l1_ratio controls L1 vs L2 mix
enet.fit(X_train, y_train)
RidgeCV and LassoCV that do this automatically.
Neural Networks Intro
What if the decision boundary isn't a straight line or even a simple curve? What if the relationship between features and output is deeply non-linear? This is where neural networks shine.
The Neuron
A single neuron is just logistic regression: take a weighted sum of inputs, pass through an activation function, and output a number. Nothing mysterious.
a = σ(wᵀx + b)
The magic happens when you stack neurons into layers. Each layer transforms the data, creating increasingly abstract representations.
Architecture
Think of it as an assembly line:
- Input layer: raw features (pixel values, word counts, etc.)
- Hidden layers: each layer learns to detect patterns. Early layers find simple patterns (edges in images, common word pairs), later layers combine them into complex concepts (faces, sentiment).
- Output layer: the final prediction (class probabilities, regression value).
Analogy: recognizing a face. Layer 1 detects edges. Layer 2 combines edges into eyes, noses, mouths. Layer 3 combines features into "this looks like a face." Each layer builds on the previous one.
Forward Propagation
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def relu(z):
return np.maximum(0, z)
# Simple 2-layer network (1 hidden layer)
# Input: 3 features, Hidden: 4 neurons, Output: 1
np.random.seed(42)
W1 = np.random.randn(4, 3) * 0.01 # (hidden_size, input_size)
b1 = np.zeros((4, 1))
W2 = np.random.randn(1, 4) * 0.01 # (output_size, hidden_size)
b2 = np.zeros((1, 1))
def forward(X):
"""X is (3, m) — 3 features, m examples"""
Z1 = W1 @ X + b1 # linear transform
A1 = relu(Z1) # activation
Z2 = W2 @ A1 + b2 # linear transform
A2 = sigmoid(Z2) # output probability
return A2
# Example: single input
x = np.array([[1.0], [0.5], [-0.3]]) # (3, 1)
print(f"Prediction: {forward(x)[0, 0]:.4f}")
Support Vector Machines
SVMs approach classification differently: instead of finding any boundary between classes, find the boundary with the maximum margin — the widest possible "street" separating the classes.
The Large Margin Intuition
Picture two groups of points on a 2D plane. Many lines could separate them. SVM picks the line that's as far as possible from both groups. The data points closest to the boundary are called support vectors — they "support" the margin. All other points are irrelevant.
This maximum margin tends to generalize better because it's less sensitive to small perturbations in the data.
The Kernel Trick
What if the data isn't linearly separable? Imagine red and blue points in a circle pattern — no line can separate them. But if you project them into a higher dimension (e.g., add a z = x² + y² feature), suddenly a flat plane can separate them.
The kernel trick lets SVMs compute in this higher-dimensional space without actually doing the expensive projection. Common kernels:
- Linear: K(x, z) = xᵀz — just the dot product, equivalent to no projection
- Polynomial: K(x, z) = (xᵀz + c)ᵈ — polynomial decision boundary
- RBF (Gaussian): K(x, z) = exp(-γ||x-z||²) — the most popular, creates smooth non-linear boundaries
The C Parameter
C controls the trade-off between a wide margin and correctly classifying every training point. Think of it as the inverse of regularization (λ = 1/C):
- Large C: narrow margin, fewer misclassifications (risk of overfitting)
- Small C: wide margin, allows some misclassifications (more regularization)
- n large relative to m (many features, few examples): logistic regression or SVM with linear kernel
- n small, m medium (10–10,000 examples): SVM with Gaussian kernel
- n small, m large (50,000+ examples): add features, then logistic regression or linear SVM (kernel SVM is too slow on large m)
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_moons
# Non-linearly separable data
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
# SVM with RBF kernel — always scale features first!
svm_rbf = make_pipeline(StandardScaler(), SVC(kernel='rbf', C=1.0, gamma='scale'))
svm_rbf.fit(X, y)
print(f"RBF accuracy: {svm_rbf.score(X, y):.3f}")
# Linear kernel — faster, works when data is linearly separable
svm_linear = make_pipeline(StandardScaler(), SVC(kernel='linear', C=1.0))
svm_linear.fit(X, y)
print(f"Linear accuracy: {svm_linear.score(X, y):.3f}")
Decision Trees & Ensembles
Decision trees are the "20 questions" of ML. Each node asks a yes/no question about a feature, splitting the data until you reach a prediction. They're intuitive, interpretable, and surprisingly powerful — especially when combined into ensembles.
How Trees Work
At each node, the algorithm picks the feature and threshold that best separates the classes. "Best" is measured by information gain (how much uncertainty is reduced):
- Gini impurity: probability of misclassifying a randomly chosen element (scikit-learn default)
- Entropy: information-theoretic measure of disorder (used in C4.5, ID3)
Both work well in practice. The tree keeps splitting until a stopping criterion is met (max depth, min samples per leaf, etc.).
Pruning
Without limits, a tree will keep splitting until every leaf has one example — massively overfitting. Pruning prevents this:
- Pre-pruning: stop growing early (max_depth, min_samples_split)
- Post-pruning: grow full tree, then remove branches that don't improve validation score
Random Forests
"Wisdom of crowds" for ML. Random forests create many decision trees, each trained on a random subset of data (bagging) and a random subset of features. Their predictions are averaged (regression) or voted (classification).
Why it works: individual trees are noisy and overfit in different ways. Averaging them cancels out the noise while preserving the signal.
Gradient Boosting
Instead of building trees independently (like random forests), boosting builds them sequentially. Each new tree focuses on the mistakes of the previous ones. It's like a student reviewing the questions they got wrong.
- XGBoost: the "default winner" for structured/tabular data competitions
- LightGBM: faster, handles large datasets, leaf-wise growth
- CatBoost: handles categorical features natively
| Algorithm | Best For | Weakness |
|---|---|---|
| Linear / Logistic | Linear relationships, interpretability | Can't capture non-linearity |
| SVM (RBF) | Clean medium-size datasets | Slow on large n, requires scaling |
| Decision Tree | Interpretability, no scaling needed | Overfits easily |
| Random Forest | General-purpose, robust | Slower, less interpretable |
| XGBoost/LightGBM | Tabular data, competitions | Tuning can be complex |
| Neural Networks | Images, text, audio, huge data | Needs lots of data, compute |
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score
X, y = load_wine(return_X_y=True)
models = {
"Decision Tree": DecisionTreeClassifier(max_depth=5),
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"{name:20s}: {scores.mean():.3f} ± {scores.std():.3f}")
Unsupervised Learning
No labels. No right answers. Just data, and the goal is to find hidden structure. This is like looking at a pile of photos and organizing them into groups without anyone telling you the categories.
K-Means Clustering
The most intuitive clustering algorithm. Think of it as the "centroid dance":
- Randomly place K centroids in the feature space
- Assign each point to its nearest centroid
- Move each centroid to the mean of its assigned points
- Repeat steps 2-3 until centroids stop moving
import numpy as np
def k_means(X, K, max_iters=100):
m, n = X.shape
# Random initialization: pick K random data points as centroids
indices = np.random.choice(m, K, replace=False)
centroids = X[indices].copy()
for _ in range(max_iters):
# Assign: compute distance from each point to each centroid
distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2) # (m, K)
labels = np.argmin(distances, axis=1) # (m,)
# Update: move centroids to mean of assigned points
new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
if np.allclose(centroids, new_centroids):
break
centroids = new_centroids
return labels, centroids
Choosing K
- Elbow method: plot cost (sum of squared distances to assigned centroid) vs K. Look for the "elbow" where adding more clusters gives diminishing returns.
- Silhouette score: measures how similar each point is to its own cluster vs nearest neighbor cluster. Range [-1, 1], higher is better.
- Domain knowledge: often you know how many segments make sense (e.g., S/M/L t-shirt sizes).
n_init=10). Use init='k-means++' for smarter initialization.
Principal Component Analysis (PCA)
Imagine your data lives in 100-dimensional space, but most of the "action" happens along just a few directions. PCA finds those directions — the principal components — and lets you project the data onto them, reducing dimensionality while preserving as much variance as possible.
Intuition: you have a cloud of 3D points that roughly lies on a tilted plane. PCA finds the plane and projects the points onto it, giving you 2D data that captures nearly all the information.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
# Always standardize before PCA!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Find principal components
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)
# How much information did we keep?
print(f"Variance explained: {pca.explained_variance_ratio_}")
print(f"Total: {pca.explained_variance_ratio_.sum():.1%}")
# Choose n_components to retain 95% of variance
pca_95 = PCA(n_components=0.95)
X_95 = pca_95.fit_transform(X_scaled)
print(f"Components for 95% variance: {pca_95.n_components_}")
Anomaly Detection
Given a dataset of "normal" examples, flag anything that looks unusual. Use cases: fraud detection, manufacturing defects, server monitoring.
The simplest approach: fit a Gaussian distribution to each feature, then flag points where the probability p(x) falls below a threshold ε.
p(x) = Π p(xⱼ; μⱼ, σⱼ²) — product of individual feature probabilities
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
# Gaussian-based (assumes multivariate normal)
ee = EllipticEnvelope(contamination=0.05) # expect ~5% anomalies
ee.fit(X_train)
anomalies = ee.predict(X_test) # -1 for anomalies, 1 for normal
# Isolation Forest — works without distributional assumptions
# Idea: anomalies are "easy to isolate" with random splits
iso = IsolationForest(contamination=0.05, random_state=42)
iso.fit(X_train)
anomalies_iso = iso.predict(X_test)
Model Evaluation & Selection
Train / Dev / Test Splits
The exam analogy: training data is the textbook, dev (validation) data is the practice exam, test data is the final exam. You study from the textbook, check your understanding on practice exams, and prove your knowledge on the final.
- Small datasets (< 10,000): 60% train / 20% dev / 20% test
- Large datasets (1M+): 98% train / 1% dev / 1% test (1% of 1M is still 10,000 — plenty)
Cross-Validation
When data is limited, k-fold cross-validation gives a more reliable estimate. Split data into k folds, train on k-1 folds, evaluate on the held-out fold. Repeat k times and average.
from sklearn.model_selection import cross_val_score, StratifiedKFold
# 5-fold cross-validation — stratified preserves class proportions
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
print(f"CV accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
Diagnosing Bias vs Variance
This is one of the most important skills in applied ML. Look at training error and dev error:
| Training Error | Dev Error | Diagnosis | What to Try |
|---|---|---|---|
| High (15%) | High (16%) | High bias (underfit) | Bigger model, more features, less regularization |
| Low (1%) | High (11%) | High variance (overfit) | More data, more regularization, simpler model |
| High (15%) | High (30%) | Both | Bigger model AND more data/regularization |
| Low (1%) | Low (2%) | Good fit | Ship it! |
Classification Metrics
Accuracy alone is often misleading. If 99% of emails are not spam, a model that always predicts "not spam" has 99% accuracy — but it's useless.
| Metric | Formula | Measures | Use When |
|---|---|---|---|
| Precision | TP / (TP + FP) | "Of things I flagged, how many were right?" | False positives are costly (spam filter) |
| Recall | TP / (TP + FN) | "Of all positives, how many did I catch?" | False negatives are costly (cancer screening) |
| F1 Score | 2·P·R / (P+R) | Harmonic mean of P and R | Need to balance both |
| AUC-ROC | Area under ROC curve | Performance across all thresholds | Comparing models overall |
from sklearn.metrics import (
classification_report, confusion_matrix, roc_auc_score
)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1] # for AUC
print(classification_report(y_test, y_pred))
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.3f}")
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
# [[TN, FP],
# [FN, TP]]
Recommender Systems
Netflix, Spotify, Amazon — they all need to answer: "given what we know about this user, what should we show them next?" Two main approaches:
Content-Based Filtering
Recommend items similar to what the user already liked. If you watched three sci-fi movies, recommend more sci-fi. Each item has a feature vector (genre, actors, director...), and you build a model per user that predicts their rating based on item features.
Pro: works for new items (no cold-start for items). Con: limited to known features, doesn't discover surprising recommendations.
Collaborative Filtering
"Users who liked X also liked Y." You don't need item features at all — just the user-item interaction matrix (ratings, clicks, purchases).
The key idea: matrix factorization. Decompose the giant (mostly empty) user-item matrix into two smaller matrices:
- User matrix U: each user is a vector of k latent factors
- Item matrix V: each item is a vector of k latent factors
- Predicted rating: r̂ᵤᵢ = uᵤᵀ · vᵢ
These latent factors might correspond to "action-ness", "romance-ness", etc., but they're learned automatically — you never tell the model what they mean.
import numpy as np
def collaborative_filter(R, K=10, alpha=0.01, lam=0.1, iters=100):
"""
Simple matrix factorization for collaborative filtering.
R: (n_users, n_items) rating matrix. 0 means unrated.
K: number of latent factors
"""
n_users, n_items = R.shape
# Initialize latent factors randomly
U = np.random.normal(0, 0.1, (n_users, K))
V = np.random.normal(0, 0.1, (n_items, K))
# Mask: which entries are observed?
mask = R > 0
for _ in range(iters):
# Predicted ratings
R_pred = U @ V.T
# Error only on observed ratings
error = (R - R_pred) * mask
# Gradient descent with L2 regularization
U += alpha * (error @ V - lam * U)
V += alpha * (error.T @ U - lam * V)
return U @ V.T # full predicted rating matrix
ML Strategy
This is perhaps Andrew Ng's most valuable contribution: practical advice for what to do when your model isn't working. Most ML courses teach algorithms. This section teaches what to try next.
Error Analysis
When your model makes mistakes, don't just look at the numbers. Manually examine 100 misclassified examples. Categorize the errors:
- "30% of errors are blurry images" → invest in denoising or better data
- "50% of errors are mislabeled" → fix the labels first
- "15% are a rare category" → collect more examples of that category
This 30-minute exercise often saves weeks of blind model tuning.
Ceiling Analysis
For a multi-stage pipeline (e.g., face detection → feature extraction → recognition), manually give each stage perfect output and measure the overall improvement. This tells you which component to improve.
If perfect face detection only improves end-to-end accuracy by 1%, don't waste time on detection — work on recognition instead.
What to Try Next
Your model has high error. You could try many things — but which one? Diagnose first:
| Problem | Try These | Don't Bother With |
|---|---|---|
| High bias (underfit) | Bigger model, more/better features, train longer, less regularization | More data (won't help!), more regularization |
| High variance (overfit) | More data, regularization, simpler model, dropout, early stopping | Bigger model (makes it worse), less regularization |
Each knob should do one thing. To fix bias: make the model bigger. To fix variance: add regularization or data. Don't confuse the two. Early stopping, for example, violates this — it simultaneously affects both bias and variance, making it harder to diagnose.
Start with a Strong Baseline
Before spending weeks on a complex model:
- Human-level performance: how well can a human do this task? This is your ceiling.
- Simple baseline: logistic regression, random forest, or even rules. How far does simple get you?
- Error analysis: where does the baseline fail? That tells you what a fancier model needs to solve.
Data-Centric AI
Ng's latest emphasis: instead of endlessly tuning models, improve the data. In most practical applications:
- Cleaning labels gives bigger gains than a fancier model
- Adding high-quality data for the hard cases matters more than more data overall
- Consistent labeling guidelines help more than labeling more examples
From Prototype to Production
Feature Engineering
Raw data rarely goes straight into a model. You need to transform it:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
# Define transformations per column type
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['occupation', 'city', 'education']
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore')),
])
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
])
# Full pipeline: preprocess → model
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100)),
])
# Now this single object handles everything
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_distributions = {
'classifier__n_estimators': randint(50, 500),
'classifier__max_depth': randint(3, 20),
'classifier__min_samples_split': randint(2, 20),
'classifier__min_samples_leaf': randint(1, 10),
}
search = RandomizedSearchCV(
pipeline,
param_distributions,
n_iter=50, # try 50 random combinations
cv=5, # 5-fold cross-validation
scoring='f1', # optimize for F1
random_state=42,
n_jobs=-1, # use all CPU cores
)
search.fit(X_train, y_train)
print(f"Best F1: {search.best_score_:.3f}")
print(f"Best params: {search.best_params_}")
Ng and Bergstra (2012) both advocate random search over grid search. With grid search, if one parameter doesn't matter, you waste evaluations on different values of it. Random search explores more unique values of the parameters that do matter.
Saving & Loading Models
import joblib
# Save the entire pipeline (preprocessing + model)
joblib.dump(search.best_estimator_, 'model_pipeline.joblib')
# Load and predict
loaded_pipeline = joblib.load('model_pipeline.joblib')
predictions = loaded_pipeline.predict(new_data)
Monitoring in Production
Your model will degrade over time as the real world changes. Watch for:
- Data drift: input distribution shifts (new user demographics, seasonal patterns)
- Concept drift: the relationship between input and output changes (what users want evolves)
- Performance degradation: track prediction accuracy on a labeled sample over time
Set up alerts when metrics drop below a threshold, and retrain regularly.
Quick Reference: When to Use What
| Scenario | First Try | If That's Not Enough |
|---|---|---|
| Tabular, few features | Logistic/Linear Regression | Random Forest → XGBoost |
| Tabular, many features | XGBoost / LightGBM | Feature selection → Neural Net |
| Images | Pre-trained CNN (transfer learning) | Fine-tune deeper layers |
| Text | TF-IDF + Logistic Regression | Pre-trained LLM (see LLMs refresher) |
| Time series | ARIMA / Prophet | LSTM / Temporal CNN |
| Anomaly detection | Isolation Forest | Autoencoder |
| Clustering | K-Means | DBSCAN / Gaussian Mixture |