System Design Refresher
Principal-engineer-level architecture decisions for modern software and data systems — tradeoffs, patterns, and real-world context
Table of Contents
1. Thinking Like a Principal Engineer
The difference between a senior engineer and a principal engineer is not technical depth alone — it is the ability to reason about systems at the organizational and multi-year timescale. Every architectural decision carries a cost that compounds over time: in maintenance burden, in team cognitive overhead, in ability to hire, and in operational complexity.
Reversible vs. Irreversible Decisions
Jeff Bezos called these Type 1 and Type 2 decisions. Type 1 decisions are one-way doors — hard to reverse once made. Type 2 are two-way doors. Most architectural decisions are actually reversible with enough effort, but the cost of reversal varies enormously.
| Decision | Reversibility | Cost to reverse | Implication |
|---|---|---|---|
| Choice of primary database | Low | Very high (migration, downtime risk) | Spend more time upfront; involve team |
| API contract with external clients | Low | High (breaking changes, versioning) | Design carefully; version from day 1 |
| Internal service decomposition | Medium | Medium (refactor, redeploy) | Start monolith, split when needed |
| Choice of queue/message broker | Medium | Medium (consumer rewrite) | Abstract behind interface early |
| Caching strategy | High | Low (change cache logic) | Iterate quickly; optimize last |
| Logging format | High | Low (script migration) | Move fast; standardize later |
The "Boring Technology" Principle
Dan McKinley's essay on choosing boring technology remains one of the most practically useful things written about engineering. The core insight: every technology you adopt has an "innovation token" cost. You only have a few tokens to spend. Use them on things that actually differentiate your business, not on being early adopter of every new database or orchestration framework.
- Postgres is boring. It handles 99% of relational workloads, has been production-hardened for 30 years, has excellent tooling, and every engineer knows it. Start here.
- Redis is boring. Use it for caching, queues, rate limiting, pub/sub. Don't invent your own.
- Kafka is boring for high-throughput event streaming. But it is also operationally heavy — use a managed version or consider a simpler alternative first.
- New technology earns a token only if it solves a problem that your boring stack demonstrably cannot.
Build vs. Buy Framework
| Factor | Build | Buy/Use OSS |
|---|---|---|
| Differentiating capability? | Yes — core to your product | No — commodity infrastructure |
| Team expertise available? | Yes, deeply | No, and hiring is slow |
| OSS/vendor solution exists? | Doesn't exist or is poor fit | Mature, widely adopted |
| Scale requirement unusual? | Yes, at extreme scale | No, typical scale |
| Compliance/data sovereignty? | Required on-prem/custom | Vendor meets requirements |
| Time to market? | Not a constraint | Urgent — ship fast |
Scope vs. Depth
Principal engineers think in systems, not components. When you propose or review a design, ask:
- What does this fail like? Not "will it fail" — everything fails. How does the failure manifest? Is it silent or noisy? Does it cascade?
- What does this look like at 10x traffic? Not "will it scale" — will it scale cheaply, or will the cost curve become a crisis?
- What does the on-call rotation look like? Every system you build, someone maintains at 3am. Is the runbook writable?
- What is the blast radius? If this service is down, what else breaks? Is the dependency graph a tree or a tangle?
- How will this be tested? If you can't describe how to test it, you don't understand it well enough to build it.
2. Monolith vs. Microservices
The monolith vs. microservices debate is almost always answered wrong in interviews and architecture reviews because engineers focus on technical properties instead of organizational context. The right answer depends on team size, deployment maturity, and domain clarity — not on what Netflix does.
When the Monolith Is Correct
The monolith is the correct default for most systems. It is not a legacy artifact — it is the appropriate architecture when:
- Team size is under ~50 engineers (a single team can own the codebase)
- The domain model is still being discovered (premature service boundaries calcify bad decisions)
- Operational maturity is low (you don't have container orchestration, service mesh, distributed tracing in place)
- The performance budget is tight (in-process calls are 100-1000x cheaper than network calls)
- Deployment frequency is low (a single deployable unit is far simpler to manage)
When to Decompose — Migration Triggers
Decompose when you have evidence, not anticipation, of these problems:
- Independent scalability — one part of the system needs 10x more resources than another, and they're coupled in deployment
- Team autonomy — two teams are blocking each other on every release because they share deployment artifacts
- Technology divergence — one subsystem genuinely needs a different runtime (e.g., ML inference in Python, real-time in Rust)
- Isolation of failure — one component failing should not bring down unrelated features
- Regulatory/compliance boundaries — PCI, HIPAA, or data residency requirements that demand hard isolation
Domain-Driven Service Boundaries
When you do decompose, use Domain-Driven Design bounded contexts as the primary partitioning criterion, not technical layers (never "put all the database access in one service"). A bounded context is a semantic boundary where a term has a consistent meaning. The canonical mistake is splitting horizontally (API service, business logic service, data service) instead of vertically by domain capability.
# Bad decomposition: technical layers
# Every feature change touches all three services
api-gateway/
business-logic-service/
data-access-service/
# Good decomposition: domain capability (vertical slice)
# Each service is independently deployable and owns its data
order-service/ # owns orders table, order events
inventory-service/ # owns inventory, warehouse events
payment-service/ # owns transactions, PCI-scoped
notification-service/ # consumes events, sends emails/SMS
user-service/ # identity, auth, profiles
The "Distributed Monolith" Anti-Pattern
A distributed monolith has the worst of both worlds: the operational complexity of microservices with the coupling of a monolith. You have it when:
- Services cannot be deployed independently (shared database, tight synchronous coupling)
- A change in Service A requires coordinated changes in Services B, C, and D
- Services share a database and write to each other's tables directly
- Integration tests require all services to be running simultaneously
Service Size Heuristics
There is no canonical answer to "how big should a microservice be," but useful heuristics:
- Two-pizza team rule (Amazon) — one team should be able to own, build, and operate the service
- Single deployment unit — if you always deploy two services together, merge them
- Independent database — the service owns all its data; no shared tables with other services
- Fits in a developer's head — a new team member should understand the service's full scope in a day
3. Synchronous vs. Asynchronous Communication
REST vs. gRPC vs. GraphQL
| Protocol | Best For | Avoid When | Key Tradeoff |
|---|---|---|---|
| REST | Public APIs, CRUD operations, browser clients, simple integrations | High-throughput internal RPC, strongly typed contracts | Human-readable, ubiquitous tooling; but verbose, no schema enforcement |
| gRPC | Internal service-to-service, polyglot environments, streaming, performance-critical paths | Browser clients (requires grpc-web), simple public APIs | Typed contracts, binary efficiency, streaming; but harder to debug, requires Protobuf |
| GraphQL | Aggregation layer for heterogeneous backends, mobile clients with bandwidth constraints, rapid UI iteration | Simple CRUD APIs, ML pipelines, non-client-facing services | Client-driven queries reduce over/under-fetching; but N+1 query problem, caching is harder, complex auth |
| WebSocket | Real-time bidirectional (chat, live dashboards, games) | Request-response patterns, sporadic updates | True push; but stateful connections complicate load balancing and scaling |
Event-Driven Architecture
Event-driven systems decouple producers from consumers temporally. The producer fires an event and does not wait for or know about downstream consumers. This enables loose coupling, independent scaling, and auditable state — but introduces eventual consistency and makes distributed tracing essential.
# Event-driven flow (order placement example)
#
# Synchronous (tight coupling):
# Client → OrderService → InventoryService → PaymentService → NotificationService
# (blocks 500ms waiting for chain)
#
# Event-driven (loose coupling):
# Client → OrderService → publishes "order.created" event
# │
# ┌──────────────────┼──────────────────┐
# ▼ ▼ ▼
# InventoryService PaymentService NotificationService
# (reserves stock) (charges card) (sends confirmation)
# (independent, (independent, (independent,
# own retry logic) own retry logic) own retry logic)
Message Queues vs. Event Streams
| Property | Message Queue (RabbitMQ, SQS) | Event Stream (Kafka, Kinesis) |
|---|---|---|
| Consumer model | Competing consumers (one gets the message) | Consumer groups (each group gets all messages) |
| Message retention | Deleted after acknowledgment | Retained for configurable period (days/forever) |
| Replay | Not possible after consumption | Any consumer can replay from offset 0 |
| Ordering guarantee | FIFO within queue (not across) | Ordered within partition |
| Consumer count scaling | Add consumers freely | Bounded by partition count |
| Use case | Task dispatch, work queue, RPC | Event log, audit trail, stream processing |
| Throughput | Moderate (tens of thousands/sec) | High (millions/sec with partitioning) |
Request-Reply vs. Fire-and-Forget
Not all async communication is fire-and-forget. The Saga pattern and choreography-based workflows often need to correlate events across multiple services. Key patterns:
- Fire-and-forget — producer publishes event, does not wait. Works for notifications, analytics, non-critical side effects
- Async request-reply — producer sends message with correlation ID, polls or subscribes to response. Works for long-running jobs
- Saga (choreography) — each service listens for events, does its work, emits next event. No central coordinator. Easier to scale, harder to debug
- Saga (orchestration) — a central orchestrator tells each service what to do. Easier to understand and trace, introduces coordination bottleneck
4. Database Selection
Database Taxonomy and Selection
| Type | Examples | Data Model | Use When | Avoid When |
|---|---|---|---|---|
| Relational | Postgres, MySQL, CockroachDB | Tables, rows, ACID | Structured data, complex queries, strong consistency needed, joins are natural | Schema changes happen constantly, write throughput > 100k/sec/node |
| Document | MongoDB, Firestore, CouchDB | JSON documents, hierarchical | Deeply nested documents, schema varies per record, developer velocity matters more than query flexibility | You need joins across document types frequently, strong consistency required |
| Wide-Column | Cassandra, DynamoDB, Bigtable | Row key + column families | Write-heavy, high throughput, known access patterns, global distribution needed | Ad-hoc queries, unknown access patterns, complex relationships |
| Graph | Neo4j, Amazon Neptune, JanusGraph | Nodes, edges, properties | Relationship-heavy: social graphs, recommendation, fraud detection, knowledge graphs | Data is primarily tabular, graph traversals are rare |
| Time-Series | InfluxDB, TimescaleDB, Prometheus | Metric + timestamp + tags | Monitoring, IoT, financial tick data, any append-only time-stamped data | Data is not primarily time-ordered, you need complex joins |
| OLAP Columnar | ClickHouse, BigQuery, Snowflake, Redshift | Columnar, MPP | Analytical queries over billions of rows, aggregations, BI dashboards | Transactional workloads, point lookups, frequent updates |
| Search | Elasticsearch, OpenSearch, Typesense | Inverted index, denormalized | Full-text search, faceted search, log analytics | Primary source of truth, strong consistency required |
Postgres vs. DynamoDB — The Real Comparison
This is the most common "relational vs. NoSQL" choice in cloud-native systems. It is not a question of scale — it is a question of access patterns.
| Dimension | Postgres | DynamoDB |
|---|---|---|
| Query flexibility | Arbitrary SQL — any predicate, any join | Access by partition key or GSI only — define access patterns upfront |
| Consistency | Serializable ACID by default | Eventually consistent by default; strong consistency optional at higher cost |
| Scalability | Vertical + read replicas; horizontal via Citus or partitioning | Horizontal by design; automatic partitioning, effectively unbounded throughput |
| Operational overhead | Requires sizing, vacuuming, index maintenance, failover management | Fully managed; no servers, automatic scaling, multi-region built-in |
| Cost model | Fixed (instance-based) — predictable at stable load | Pay per request — can be expensive at high uniform throughput, cheap at spiky load |
| Schema flexibility | Strict schema with migrations | Schemaless — different items in same table can have different attributes |
| Transactions | Full multi-table ACID transactions | Limited transactions (up to 100 items, same region) |
Cassandra vs. Postgres for Write-Heavy Workloads
Cassandra is optimized for write throughput at the expense of read flexibility. It uses an LSM tree internally (appends go to a memtable, flushed to SSTables, compacted periodically). Reads may need to merge multiple SSTables. This makes writes extremely fast but reads more expensive than B-tree databases unless they align with the partition key.
- Use Cassandra when: millions of writes/second, known read access patterns (always by user_id + time range), multi-datacenter replication needed, time-series or event log data
- Use Postgres when: complex queries, OLTP, relational integrity needed, team is SQL-fluent, scale is not yet proven to exceed Postgres's ceiling
5. Caching Strategies
Caching is one of the highest-leverage performance interventions available, but it introduces a consistency problem that is frequently underestimated. Every caching layer is a replica that may lag behind the source of truth. The complexity is not in reading from the cache — it is in deciding when the cache is stale.
Core Cache Patterns
| Pattern | How it works | Best for | Cache consistency |
|---|---|---|---|
| Cache-aside (lazy) | App checks cache; on miss, reads DB, populates cache, returns | Read-heavy, tolerates brief staleness | Cache may be stale until TTL expires |
| Read-through | Cache library handles miss; fetches from DB transparently | Same as cache-aside but with cleaner app code | Same as cache-aside |
| Write-through | Every write goes to cache AND DB synchronously | Write + read on same data, consistency critical | Cache always current; write latency doubled |
| Write-behind (write-back) | Write to cache, async flush to DB | Write-heavy, eventual persistence acceptable | Risk of data loss if cache crashes before flush |
| Refresh-ahead | Cache proactively refreshes popular entries before TTL expires | Known hot keys (top products, trending content) | Near-current; complex to implement |
Cache Invalidation — The Hard Problem
Phil Karlton's observation that "there are only two hard things in computer science: cache invalidation and naming things" understates the problem in distributed systems. Options for invalidation:
- TTL-based expiry — simplest approach; tolerate staleness up to TTL. Works for most cases. Set TTL based on acceptable lag, not system capability.
- Event-driven invalidation — publish "entity updated" events, consumers invalidate their caches. Requires event infrastructure; risk of invalidation storms
- Write-through with versioning — include version/etag in cache key; clients use latest version. Solves staleness but increases key complexity
- Database-triggered invalidation — use CDC (Change Data Capture) to watch DB writes and invalidate downstream caches automatically
Redis vs. Memcached
| Dimension | Redis | Memcached |
|---|---|---|
| Data structures | Strings, hashes, lists, sets, sorted sets, streams, geospatial, HyperLogLog | Strings only |
| Persistence | RDB snapshots + AOF write-ahead log | None — purely ephemeral |
| Replication | Built-in primary/replica + Redis Cluster for sharding | Manual client-side sharding only |
| Lua scripting | Yes — atomic multi-step operations | No |
| Memory efficiency | Slightly higher overhead per key | More memory-efficient for pure string caching |
| Use Redis for | Almost everything — rate limiting, session store, distributed lock, leaderboard, pub/sub, job queue | |
| Use Memcached for | Pure string cache where maximum memory efficiency matters and you need multi-threaded scaling across many CPUs | |
Thundering Herd and Cache Stampede
When a popular cache entry expires simultaneously, thousands of requests may all miss the cache and hit the database at once. Mitigations:
- Jitter on TTL — randomize TTL slightly (
TTL = base + random(0, base * 0.1)) to prevent synchronized expiry - Mutex/lock on cache miss — only one request rebuilds; others wait. Reduces DB load but adds latency for waiting requests
- Probabilistic early expiration — proactively refresh before TTL expires with probability proportional to proximity to expiry (XFetch algorithm)
- Stale-while-revalidate — serve stale data immediately, trigger async refresh in background
6. Data Pipeline Architecture
Batch vs. Streaming vs. Micro-batch
| Approach | Latency | Throughput | Complexity | Use When |
|---|---|---|---|---|
| Batch | Minutes to hours | Highest (amortize overhead) | Low | Daily reporting, ML training, data warehouse loads, non-time-sensitive aggregations |
| Micro-batch (Spark Streaming) | Seconds to minutes | High | Medium | Near-real-time dashboards, fraud detection where seconds are acceptable |
| Streaming (Flink, Kafka Streams) | Milliseconds | Lower per event | High | Real-time fraud, live recommendations, alerting on metrics, order processing pipelines |
Lambda vs. Kappa Architecture
# Lambda Architecture (two processing paths)
#
# Raw Events ──┬──▶ Batch Layer (Spark/Hadoop) ──▶ Batch Views
# │ │
# └──▶ Speed Layer (Flink/Storm) ──▶ RT Views ──▶ Serving Layer
# │ (merges views)
#
# Problem: two codebases doing same logic; hard to keep in sync
# Kappa Architecture (stream-only)
#
# Raw Events ──▶ Event Log (Kafka) ──▶ Stream Processor ──▶ Serving Store
# │ (Flink / Spark)
# └── replay for backfill / recompute
#
# Benefit: one codebase, replay for correctness
# Cost: streaming semantics are harder; exactly-once is complex
ELT vs. ETL in the Modern Stack
The shift from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform) was enabled by cheap columnar storage and SQL-centric warehouses like Snowflake and BigQuery. The transformation now happens inside the warehouse using SQL (dbt) rather than in an external compute layer.
- ETL — transform before loading. Required when data is too large to load raw, or when PII must not enter the warehouse. Tool: Spark, Python scripts, custom pipelines.
- ELT — load raw, transform in-warehouse with SQL. Faster to operationalize, cheaper at modern cloud scale, transformations version-controlled with dbt. The modern default.
Medallion Architecture (Bronze / Silver / Gold)
# Medallion / Lakehouse layering
#
# Bronze (Raw) — exactly as received from source; append-only
# format: JSON/Avro, partitioned by ingestion date
# retention: long (years) — source of truth for replay
#
# Silver (Curated) — cleaned, deduplicated, validated, typed
# format: Parquet with schema
# joins: reference data merged in; nulls handled
#
# Gold (Aggregated) — business-level aggregates, ready for BI/ML
# format: wide denormalized tables
# audience: analysts, dashboards, features stores
When to Use Spark vs. Flink vs. Simple SQL
| Tool | Use For | Avoid When |
|---|---|---|
| Simple SQL / dbt | Batch transformations in a warehouse, business logic expressed cleanly in SQL, ELT pipelines | Complex stateful logic, streaming, data outside the warehouse |
| Spark | Large-scale batch processing, ML training data prep, joining massive datasets, Python/Scala ecosystem | Low-latency streaming (<1s), simple transformations that SQL handles |
| Flink | True streaming with event time semantics, stateful stream processing, exactly-once guarantees, low-latency | Batch-only workloads, teams without streaming expertise |
| Kafka Streams / ksqlDB | Simple stream transformations, filtering, enrichment within Kafka ecosystem | Complex stateful joins across multiple streams, ML inference |
7. Storage & Object Store Patterns
S3 as the Universal Storage Layer
S3 (and compatible APIs: GCS, Azure Blob, MinIO) has become the de facto storage layer for modern data systems. Its durability (11 nines), cost (fraction of a cent per GB-month), and ubiquitous support make it the right default for any data that is accessed less than constantly and needs to outlive the compute that created it.
- Data lakes — raw data in S3; query engines (Athena, Trino, Spark) read directly. No need to load into a database first.
- ML artifacts — model weights, training checkpoints, feature datasets. Never store in a relational database.
- Log archives — ship logs to S3 for long-term retention; query with Athena when needed rather than keeping in hot Elasticsearch.
- Static assets — images, videos, PDFs. S3 + CloudFront is the standard CDN pattern.
Data Lake vs. Data Warehouse vs. Lakehouse
| Concept | Storage | Query | Consistency | Cost |
|---|---|---|---|---|
| Data Lake | S3/GCS, raw files (Parquet, Avro, JSON) | Athena, Trino, Spark — on-demand | No ACID; manual schema management | Storage cheap; query compute charged per scan |
| Data Warehouse | Proprietary columnar (Snowflake, BigQuery managed storage) | SQL, fast indexing, metadata caching | ACID within warehouse | Storage + compute bundled or separate |
| Lakehouse | S3/GCS + Delta Lake / Apache Iceberg / Apache Hudi table format | Spark, Trino, Flink — open format | ACID via transaction log on S3 | Storage at S3 rates + separate compute |
File Format Selection
| Format | Orientation | Schema | Use For | Avoid For |
|---|---|---|---|---|
| Parquet | Columnar | Embedded (self-describing) | Analytics, ML training, data warehouse ingestion, default choice for structured data | Row-at-a-time access (point lookups), streaming writes |
| ORC | Columnar | Embedded | Hive/Spark workloads where ORC's bloom filters provide better predicate pushdown | Non-Hadoop ecosystems; Parquet is more widely supported |
| Avro | Row | External (Schema Registry) | Kafka serialization, row-at-a-time streaming writes, schema evolution with registry | Analytical queries (columnar is far faster) |
| JSON/JSONB | Row | None (schemaless) | Bronze/raw layer in lake, API payloads, debugging | Any analytical workload (50-100x slower than Parquet at scale) |
| CSV | Row | None | Human inspection, one-time exports, small files | Production data pipelines (no types, fragile to escaping, slow) |
Hot / Warm / Cold Tiering
Storage tiering dramatically reduces cost by moving data to cheaper storage as it ages. S3 supports tiering natively with lifecycle policies:
# S3 Lifecycle policy example
Rules:
- ID: "tier-old-data"
Status: Enabled
Transitions:
# Hot: S3 Standard (immediate access, ~$0.023/GB/month)
- Days: 30
StorageClass: STANDARD_IA # Warm: infrequent access (~$0.0125/GB/month)
- Days: 90
StorageClass: GLACIER_IR # Cold: instant retrieval (~$0.004/GB/month)
- Days: 365
StorageClass: DEEP_ARCHIVE # Archive: ~$0.00099/GB/month, 12hr retrieval
8. Search & Indexing
Search Engine Comparison
| Engine | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Elasticsearch / OpenSearch | Mature, full-featured, distributed, excellent aggregations, observability ecosystem (ELK/EFK) | Operationally complex, memory-hungry, eventual consistency, cluster management is non-trivial | Log analytics, application search with complex facets, enterprise search at scale |
| Typesense | Typo-tolerant, fast, easy to operate (single binary), instant search UX | Smaller feature set, no native distributed mode (Typesense Cloud for HA) | E-commerce product search, SaaS app search, developer-friendly deployments |
| Meilisearch | Typo-tolerant, zero-config, instant results, great developer UX | Not designed for >100M documents, limited aggregations | Small to medium datasets, developer tools, documentation search |
| Postgres Full-Text | Already in your stack, GIN indexes, tsvector/tsquery, no extra service | No typo tolerance, weaker relevance ranking, no semantic search | Internal search where basic FTS is sufficient, avoiding extra infrastructure |
| Algolia | Fastest time-to-market, excellent hosted experience, SDKs for every platform | Expensive at scale, vendor lock-in, limited customization | Early-stage products where search UX matters but engineering capacity is limited |
Full-Text vs. Vector Search
Traditional full-text search relies on an inverted index: a mapping from terms to document IDs. It is fast, exact, and deterministic. Vector (semantic) search embeds documents and queries into a high-dimensional space and retrieves nearest neighbors. It understands synonyms and paraphrases but is slower and less interpretable.
| Dimension | Full-Text (BM25) | Vector (Semantic) |
|---|---|---|
| Precision on exact terms | Excellent — matches exact keywords | Weaker — may return thematically related but term-distant results |
| Synonym handling | Requires manual synonym dictionaries | Handles synonyms automatically via embedding |
| Query speed | Milliseconds — inverted index lookup | Tens to hundreds of ms — ANN scan over vectors |
| Infrastructure | Elasticsearch, Postgres FTS | pgvector, Pinecone, Weaviate, Qdrant, Milvus |
| Best for | Code search, legal documents, exact product SKUs | Q&A, RAG (retrieval-augmented generation), semantic similarity |
Inverted Index Internals (Why It Matters)
Understanding the inverted index explains Elasticsearch's behavior in production. When you index a document, the analyzer chain runs: tokenization → lowercasing → stop word removal → stemming → tokens stored in the inverted index with position and frequency metadata.
- Term frequency (TF) — how often a term appears in a document. Higher = more relevant for that term.
- Inverse document frequency (IDF) — how rare a term is across all documents. Rare terms are more discriminative.
- BM25 — the modern scoring formula used by Elasticsearch. Improves on TF-IDF with saturation (term frequency stops growing in relevance beyond a point) and document length normalization.
- Mapping explosion — the #1 operational issue in Elasticsearch. If you index arbitrary JSON keys as fields, cardinality explodes. Always use explicit mappings in production; never let ES auto-create mappings on dynamic nested objects.
9. API Design Decisions
REST Maturity Model (Richardson)
Most "REST" APIs are Level 2. True Level 3 (HATEOAS) is rarely implemented in practice. Know the model to know where to invest.
| Level | Characteristics | Example |
|---|---|---|
| Level 0 — POX | HTTP as transport; single endpoint, all via POST | POST /api {"action": "getUser", "id": 1} |
| Level 1 — Resources | Multiple URIs, one per resource concept | POST /users/1 for everything about user 1 |
| Level 2 — HTTP Verbs | Uses GET/POST/PUT/DELETE correctly, HTTP status codes | GET /users/1, DELETE /users/1 — industry default |
| Level 3 — HATEOAS | Responses include links to related actions (self-describing API) | Response body contains {"links": {"delete": "/users/1"}} |
API Versioning Strategies
Versioning is about managing change for clients you do not control. For internal APIs, you can break the contract and coordinate across teams. For external/public APIs, you must version.
- URL path versioning (
/v1/users,/v2/users) — most common, explicit, easy to route. Downside: duplication of routes. - Header versioning (
API-Version: 2024-01-01) — cleaner URLs; Stripe uses date-based versioning. Harder to test with a browser. - Query parameter (
?version=2) — simple but pollutes query params; acceptable for internal tools. - No versioning (additive only) — only make backward-compatible changes: add fields, don't remove or rename. Feasible for JSON APIs where clients ignore unknown fields.
2024-06-20). When you create an API key, it is pinned to the current API version. Your integration never breaks unless you explicitly upgrade. Stripe maintains backward compatibility for every historical version in production — an enormous investment, but it is why developers trust Stripe's API. For most teams: URL path versioning is sufficient. Start with /v1/ even if you're not sure you'll need v2.
Pagination Patterns
| Pattern | How It Works | Pros | Cons |
|---|---|---|---|
| Offset / Limit | ?offset=20&limit=10 | Simple, SQL-native, supports jumping to page N | Skips/duplicates on concurrent inserts; slow at large offsets (DB must scan and discard) |
| Cursor-based | ?after=eyJpZCI6MTAwfQ== (opaque encoded cursor) | Stable pagination under concurrent writes; O(1) page fetch | Cannot jump to arbitrary page; cursor encodes position (opaque to client) |
| Keyset pagination | ?after_id=100&after_created=2024-01-01 | Performant with index on sort key; stable | Multi-column sort keys add complexity |
| Page token (Google style) | ?pageToken=<server-generated> | Server-side encoding hides internals; supports cursor | Cannot bookmark specific page; tokens may expire |
OFFSET 1000000 LIMIT 20 requires the database to scan and discard 1 million rows.
Idempotency and Retry Safety
Any API that triggers a state change should support idempotency keys. The client generates a unique key per logical operation; the server deduplicates using that key. This enables safe retries in the presence of network failures — critical for payment APIs, order creation, and any operation with real-world side effects.
# Idempotent POST: client generates idempotency key
POST /v1/payments
Idempotency-Key: a4e9-4b27-8f31-...
# Server behavior:
# 1. Check if key seen before → return cached response
# 2. If not → process, store result against key
# 3. Return result
# Idempotency window: typically 24 hours
# Storage: Redis with TTL or a database table indexed by key
Rate Limiting Strategies
- Token bucket — tokens refill at a constant rate; allows bursting up to bucket capacity. Redis + Lua script for atomic check-and-decrement. Most common for APIs.
- Fixed window counter — count requests per minute per client. Simple; susceptible to burst at window boundary (100 requests at :59 + 100 at :01 = 200 in 2 seconds).
- Sliding window log — exact but memory-expensive (store every request timestamp).
- Sliding window counter — approximate sliding window using two fixed windows. Accurate enough for most use cases.
10. Authentication & Authorization at Scale
OAuth2 / OIDC Flow Selection
| Flow | Use For | Key Characteristic |
|---|---|---|
| Authorization Code + PKCE | Web apps, mobile apps, any user-facing client | Gold standard; short-lived auth code exchanged for tokens; PKCE prevents code interception |
| Client Credentials | Service-to-service (machine-to-machine) | No user involved; service authenticates with client_id + secret directly |
| Device Code | Smart TVs, CLI tools, headless devices | User completes auth on separate device; polling for token |
| Implicit (deprecated) | Avoid — replaced by Auth Code + PKCE | Tokens in URL fragment; vulnerable to history/referrer leakage |
| Resource Owner Password | Avoid — legacy; breaks trust model | Client receives user's credentials; violates zero-trust |
JWT vs. Session Tokens
| Dimension | JWT (stateless) | Session Token (stateful) |
|---|---|---|
| Server state | None — token is self-contained | Session store (Redis/DB) required |
| Revocation | Hard — token is valid until expiry; needs blocklist | Easy — delete session from store |
| Payload size | Large — claims embedded (200-2000 bytes) | Small — opaque random token (~32 bytes) |
| Horizontal scaling | Stateless — any replica validates signature | Requires shared session store (Redis) |
| Clock skew sensitivity | High — exp check requires synchronized clocks | None — server controls expiry |
| Recommendation | Use JWTs for service-to-service, short-lived tokens (<15 min). Use session tokens for user sessions where revocation matters. Never store sensitive data in JWT payload (it's only Base64 encoded, not encrypted unless you use JWE). | |
RBAC vs. ABAC vs. ReBAC
| Model | Policy basis | Example | Use When |
|---|---|---|---|
| RBAC (Role-Based) | User's roles | User has "admin" role → can do everything | Simple, coarse-grained permissions; most B2B SaaS |
| ABAC (Attribute-Based) | User + resource + environment attributes | User in "engineering" department can access resources tagged "internal" before 5pm | Complex policies with context (time, location, resource attributes) |
| ReBAC (Relationship-Based) | Graph of relationships between entities | User can edit document only if user is member of the document's owning team | Google Drive, GitHub permissions — hierarchical resource ownership |
(object, relation, user). "User:alice can viewer document:doc123" is a tuple. Authorization is a graph traversal. Open-source implementations include OpenFGA (by Auth0/Okta) and SpiceDB (by Authzed). This model scales to billions of objects and handles complex inheritance rules — but is significant operational investment. Use RBAC until it hurts.
Service-to-Service Authentication
- mTLS (mutual TLS) — both client and server present certificates. Cryptographic identity, no shared secrets. Service meshes (Istio, Linkerd) handle certificate rotation automatically. The gold standard for zero-trust networks.
- JWT with shared signing key or public key verification — service A signs a JWT; service B verifies with known public key. Simpler than mTLS, but key rotation requires coordination.
- API keys — symmetric secret in the
Authorizationheader. Simple; poor rotation story; no standard revocation mechanism. Acceptable for internal services if secrets management is solved. - SPIFFE/SPIRE — standards-based identity framework for workloads. Issues SVIDs (X.509 or JWT) to workloads; integrates with Kubernetes service accounts.
11. Scaling Patterns
Vertical vs. Horizontal Scaling
| Dimension | Vertical (Scale Up) | Horizontal (Scale Out) |
|---|---|---|
| Mechanism | Bigger machine (more CPU/RAM) | More machines of same size |
| Simplicity | Simple — no code changes needed | Complex — requires stateless design or distributed state management |
| Cost curve | Linear to exponential (largest instances have poor price/performance) | Linear — commodity machines |
| Failure domain | Single point of failure | Partial failure possible; one node failing doesn't kill all capacity |
| Best for | Databases (Postgres scales surprisingly far vertically), early-stage systems | Stateless app servers, caches, workers that can run on any node |
Database Sharding Strategies
Sharding is the last resort, not the first response, to database scaling. Before sharding, exhaust: read replicas, caching, query optimization, vertical scaling, table partitioning.
- Range sharding — shard by range of values (e.g., user_id 1-1M on shard 1). Simple; hotspots likely if access is skewed (e.g., new users concentrate on last shard).
- Hash sharding — hash the shard key; distribute evenly. No hotspots; but range queries require hitting all shards.
- Directory-based sharding — a lookup table maps entity ID to shard. Flexible; but the lookup table is a bottleneck and SPOF.
- Consistent hashing — distribute data across a ring; adding/removing nodes redistributes only a fraction of keys. Used in Cassandra, DynamoDB.
Connection Pooling — The Real Bottleneck
Database connection limits are frequently the actual scaling bottleneck before raw query throughput. Postgres handles ~100-300 concurrent connections before performance degrades (each connection is a process fork). At 1000 application servers with 10 connections each, you exceed Postgres's limit immediately.
# Connection math — common failure mode
#
# 200 app pods × 10 connections each = 2,000 connections to Postgres
# Postgres max_connections default = 100 → immediate failure
# Even with max_connections=1000 → severe performance degradation
#
# Solution: PgBouncer (connection pooler) in transaction mode
# 200 app pods → PgBouncer (20 server connections) → Postgres
#
# PgBouncer transaction mode:
# - Client connection is a logical connection only
# - Physical DB connection held only during active transaction
# - 1,000 client connections → 20 actual Postgres connections
Load Balancing Algorithms
| Algorithm | How It Works | Best For |
|---|---|---|
| Round Robin | Requests distributed sequentially to each server | Homogeneous servers, similar request cost |
| Least Connections | Route to server with fewest active connections | Long-lived connections (WebSockets, gRPC streams) |
| Least Response Time | Route to server with lowest latency + fewest connections | Heterogeneous servers, latency-sensitive workloads |
| IP Hash / Sticky Sessions | Hash client IP to consistently route to same server | Stateful servers (avoid with stateless design) |
| Random with two choices | Pick two servers randomly; send to the less loaded | Proven near-optimal distribution with minimal coordination cost |
12. Reliability & Resilience
Circuit Breakers, Retries, and Timeouts
Reliability patterns are not optional in distributed systems — they are the difference between isolated failures and cascading outages. The three must be implemented together; any one alone is insufficient.
| Pattern | Purpose | Without It |
|---|---|---|
| Timeout | Bound the maximum time waiting for a response | Threads block indefinitely; thread pool exhaustion cascades into full service outage |
| Retry with backoff | Recover from transient failures automatically | Transient network errors cause permanent failures; users see errors that would have resolved in 100ms |
| Circuit breaker | Stop calling a failing service to allow it to recover | Avalanche of retries overwhelms a struggling service, preventing recovery (retry storm) |
| Bulkhead | Isolate failure domains with separate thread pools / resource pools | One slow dependency exhausts all threads, taking down unrelated features |
# Circuit breaker state machine
#
# CLOSED (normal) ──[failure threshold exceeded]──► OPEN (rejecting all calls)
# │
# [after timeout]
# ▼
# HALF-OPEN (probe calls)
# / \
# [success] [failure]
# ▼ ▼
# CLOSED OPEN
#
# Configuration example:
# failure_threshold = 50% # open after 50% of calls fail in window
# window_size = 10 calls # evaluate over last N calls
# recovery_timeout = 30s # wait before trying HALF-OPEN
# probe_success_count = 3 # N successes to close circuit
Retry Strategy Details
- Exponential backoff — wait time doubles on each retry: 1s, 2s, 4s, 8s... Reduces retry storms.
- Jitter — add randomness to backoff to prevent synchronized retry spikes across many clients:
wait = base * 2^attempt + random(0, base) - Retry budget — limit the percentage of total requests that can be retries to prevent amplifying load during an outage. Google SRE recommends no more than 10% of requests should be retries.
- Idempotency required — only retry idempotent operations automatically. Never auto-retry a non-idempotent POST without an idempotency key.
SLOs as Architecture Contracts
An SLO (Service Level Objective) is not a monitoring concern — it is an architectural contract that shapes design decisions. When you define "99.9% of requests complete in <200ms," you have made concrete decisions about caching, database indexing, async processing, and circuit breaker thresholds.
- Error budget — at 99.9% availability, you have 8.7 hours/year of allowed downtime. That budget determines how aggressively you can deploy, how much technical debt you can carry, and what on-call response time is acceptable.
- Alerting on SLO burn rate — alert when your error budget is burning faster than expected, not on raw error rate. A 5% burn rate matters differently on Monday vs. Friday before a holiday.
- Multi-window alerting (Google's recommendation) — alert on 1-hour window (fast burn detection) AND 6-hour window (slow burn detection) simultaneously to reduce false positives.
Multi-Region: Active-Active vs. Active-Passive
| Topology | How It Works | RPO/RTO | Cost | Complexity |
|---|---|---|---|---|
| Active-Passive | One region serves traffic; passive region is hot standby. Failover promotes passive to active. | RPO: seconds to minutes (replication lag). RTO: minutes (DNS propagation, warm-up) | Medium — passive region is mostly idle | Lower — no conflict resolution needed |
| Active-Active | Both regions serve traffic; writes may occur in both simultaneously. | RPO: near-zero (synchronous replication) or seconds (async). RTO: near-zero (traffic already flowing) | High — full capacity in both regions | Very high — requires conflict resolution, distributed transactions, or eventual consistency design |
13. Data Engineering at Scale
Schema Registry and Schema Evolution
In event-driven systems, schema management is a first-class concern. A message schema changes; consumers break. A schema registry centralizes schema versions and enforces compatibility rules before a producer can publish a new schema.
- Backward compatibility — new schema can read old data (new consumer, old messages). Add fields with defaults; never delete required fields.
- Forward compatibility — old schema can read new data (old consumer, new messages). Add only optional fields; consumers ignore unknown fields.
- Full compatibility — both directions. Most restrictive; safest for long-lived topics.
// Avro schema evolution example — backward compatible change
// Old schema
{ "type": "record", "name": "User",
"fields": [
{"name": "id", "type": "string"},
{"name": "email", "type": "string"}
]
}
// New schema — adds optional field with default (backward compatible)
{ "type": "record", "name": "User",
"fields": [
{"name": "id", "type": "string"},
{"name": "email", "type": "string"},
{"name": "phone", "type": ["null", "string"], "default": null}
]
}
Change Data Capture (CDC)
CDC is the pattern of capturing every row-level change in a database (insert, update, delete) and streaming those changes to downstream systems. This enables real-time data pipelines without expensive polling queries, and is the foundation for the "database as event log" architecture.
- Log-based CDC (Debezium, Maxwell) — reads database write-ahead log (WAL in Postgres, binlog in MySQL). Zero impact on source database performance. Captures all changes including deletes.
- Query-based CDC (Airbyte, Fivetran) — periodically polls for rows where
updated_at > last_checkpoint. Simple; misses hard deletes; adds query load. - Trigger-based CDC — database triggers write to a changelog table. High overhead; generally avoid in production.
Exactly-Once Semantics
At-least-once delivery (retries on failure) is easy. Exactly-once processing is hard because it requires coordination between the consumer, the processing logic, and the output store. The usual pragmatic approach is "at-least-once delivery + idempotent processing = effectively exactly-once."
| Delivery Guarantee | Risk | How to Achieve |
|---|---|---|
| At-most-once | Data loss possible — message consumed, crash before ack | Ack before processing (simplest, worst for data pipelines) |
| At-least-once | Duplicate processing — message processed, crash before ack → retry | Ack only after successful processing; idempotent consumers handle duplicates |
| Exactly-once | Neither loss nor duplicates | Kafka transactions + transactional producers/consumers; or idempotent upserts in sink |
Orchestration: Airflow vs. Dagster vs. Prefect
| Tool | Paradigm | Strengths | Weaknesses | Choose When |
|---|---|---|---|---|
| Airflow | DAG-based, Python operators, scheduler-centric | Mature, large ecosystem, battle-tested at scale, rich UI | Dynamic DAGs are painful, no native data-awareness, scaling scheduler is hard, slow iteration | Team has Airflow experience, large existing DAG library, GCP (Cloud Composer) |
| Dagster | Asset-centric, data-aware, software-defined assets | Strong type system, lineage tracking, testing-friendly, asset graph UI, built-in sensors | Steeper learning curve, smaller community than Airflow | New data platform build, want strong testing + lineage from day 1, dbt + Spark integration |
| Prefect | Flow-centric, Python-native, easy to start | Minimal boilerplate, hybrid execution model, fast onboarding | Asset-awareness less mature than Dagster | Team wants fastest time-to-value, mixed Python/cloud workloads |
Data Quality Frameworks
Data quality failures are production incidents. The pipeline runs successfully but produces wrong numbers. Prevention requires validation at every layer:
- Great Expectations / Soda Core — declarative expectations on data (row count, null rate, value distribution). Run at each pipeline stage; fail fast on violations.
- dbt tests — built-in: uniqueness, not-null, accepted values, referential integrity. Custom: arbitrary SQL assertions. Run in CI on every PR.
- Schema enforcement — use Schema Registry for streams; use Pydantic or Pandera for DataFrame validation in Python pipelines.
- Statistical monitoring — Monte Carlo, Bigeye — detect anomalies in row counts, null rates, and distribution shifts without writing explicit expectations. Useful when you can't enumerate all expected values.
14. ML System Design
Training vs. Serving Architecture
# ML system high-level architecture
#
# ┌────────────────────────────────────────────────────────────┐
# │ Offline (Training) │
# │ Feature Store ──▶ Training Pipeline ──▶ Model Registry │
# │ (batch features) (Spark/Python) (versioned) │
# └────────────────────────────────────────────────────────────┘
# │ │
# │ Feature pipeline populates │ Deploy
# ▼ ▼
# ┌────────────────────────────────────────────────────────────┐
# │ Online (Serving) │
# │ Request ──▶ Feature Store ──▶ Model Server ──▶ Response │
# │ (online features, (TF Serving, │
# │ low-latency) TorchServe) │
# └────────────────────────────────────────────────────────────┘
Feature Stores
A feature store solves the "train-serve skew" problem: training features are computed differently from serving features, leading to degraded model performance in production. The store provides a single definition for each feature that is used in both training (batch reads) and serving (low-latency online reads).
| Dimension | Offline Store | Online Store |
|---|---|---|
| Purpose | Historical feature retrieval for training | Low-latency feature retrieval for serving |
| Storage | Data warehouse (BigQuery, Snowflake, Parquet on S3) | Redis, DynamoDB, Bigtable, Cassandra |
| Latency target | Batch (minutes to hours) | < 10ms P99 |
| Operations | Point-in-time correct joins, training dataset generation | Lookup by entity key (user_id, item_id) |
| Examples | Feast, Tecton, Hopsworks, Vertex AI Feature Store, SageMaker Feature Store | |
Batch vs. Real-Time Inference
| Type | Latency | Use For | Example |
|---|---|---|---|
| Batch inference | Minutes to hours | Pre-compute predictions for all entities; results stored in a lookup table | Daily churn prediction for all users; nightly recommendation pre-computation |
| Near-real-time | Seconds | Scoring on recent events via streaming pipeline | Fraud scoring on completed transaction via Kafka + Flink |
| Online inference | Milliseconds | Score per request in the user's critical path | Search ranking, content recommendation, ad CTR prediction |
Model Monitoring and Drift Detection
A model that passes all offline tests can degrade silently in production. Production ML monitoring is a different discipline from application monitoring:
- Data drift — the distribution of input features shifts from the training distribution. Detect with KL divergence, Population Stability Index (PSI), or statistical tests on feature distributions.
- Concept drift — the relationship between features and labels changes (e.g., a fraud model trained pre-COVID on different behavior patterns). Detect by tracking label distribution and model performance metrics over time.
- Prediction drift — the model's output distribution changes even if inputs haven't. Leading indicator of problems before ground truth is available.
- Shadow deployment — new model serves shadow traffic (requests routed to it but responses discarded); compare predictions to production model offline before promoting.
15. Observability-Driven Design
The Three Pillars — and What They Actually Mean
| Pillar | What It Gives You | Key Tooling | Design Implication |
|---|---|---|---|
| Metrics | Numeric time-series aggregates — request rate, latency percentiles, error rate, saturation | Prometheus, Datadog, CloudWatch, Grafana | Instrument every service boundary with rate/error/duration (RED) or utilization/saturation/errors (USE) |
| Logs | Discrete events with context — what happened at a specific moment | ELK/EFK, Datadog Logs, CloudWatch Logs, Loki | Structure logs as JSON from day 1; include trace_id, user_id, request_id in every log line |
| Traces | End-to-end request path across services — latency breakdown by hop | Jaeger, Zipkin, Datadog APM, AWS X-Ray, OpenTelemetry | Instrument at service entry/exit with span propagation; traces become your architecture documentation |
Structured Logging from Day 1
Unstructured log lines (ERROR: failed to process order 12345 for user 67890) require regex parsing to query. Structured JSON logs are queryable immediately with any log platform. The ROI compounds: every grep replaced by a Kibana filter saves minutes per incident.
// Bad: unstructured string
"ERROR: failed to process order 12345 for user 67890 in 2.3s"
// Good: structured JSON — every field queryable without parsing
{
"level": "error",
"timestamp": "2024-01-15T10:23:45.123Z",
"service": "order-service",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"user_id": "67890",
"order_id": "12345",
"duration_ms": 2300,
"error": "payment_service_timeout",
"message": "order processing failed"
}
SLO-Based Alerting
Alerting on raw metrics (CPU > 80%, error rate > 1%) produces alert fatigue without actionability. Alerting on SLO burn rate gives you signal proportional to user impact.
- Fast burn — alert if error budget burns at 14x in any 1-hour window (36x more errors than normal for 1 hour depletes 5% of monthly budget). Page on-call immediately.
- Slow burn — alert if error budget burns at 6x in any 6-hour window. Ticket creation; investigate but no immediate page.
- Exhaustion forecast — alert if current burn rate, if sustained, will exhaust monthly budget before month end. Planning signal, not an incident.
Cardinality Management in Metrics
High-cardinality labels in metrics systems cause memory explosions. Prometheus stores a time series per unique combination of label values. Adding user_id as a label to a request metric creates a new time series for every user — potentially millions.
16. Migration & Evolution Strategies
Strangler Fig Pattern
The strangler fig is the canonical pattern for migrating from a legacy system without a big-bang rewrite. Name comes from the strangler fig vine that grows around a host tree and eventually replaces it. The key insight: you never turn off the old system until the new system fully handles its workload.
# Strangler Fig implementation via reverse proxy
#
# Phase 1: New system handles ONE capability; proxy routes that path to new
# Client ──▶ Reverse Proxy ──┬──▶ Legacy System (all paths)
# └──▶ New System (/api/orders only)
#
# Phase 2: Migrate more capabilities; adjust proxy routing
# Phase 3: Legacy system handles only remaining capabilities
# Phase 4: All capabilities migrated; decomission legacy
#
# Key principle: at every phase, both systems are running in production.
# You can pause migration, roll back routing, and observe real traffic.
Dual-Write Pitfalls
During a database migration, teams often "dual-write" to both the old and new system to keep them in sync. This pattern is fragile because writes to two systems are not atomic — one may succeed and the other fail, leaving them inconsistent.
- Problem 1: Partial write failure — write succeeds in old DB, fails in new DB. Now systems diverge.
- Problem 2: Ordering — concurrent writes may arrive in different order at each system, causing divergence.
- Better approach: CDC-based migration — use Debezium to capture writes from the source DB and replicate to the target. The WAL is an ordered log; replication is reliable. Enable writes to the new system only after replication is caught up and verified.
- Better approach: Expand-Contract — add new column (expand), backfill, dual-read (read both, prefer new), switch writes to new column only (contract), remove old column.
Blue-Green vs. Canary Deployments
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Blue-Green | Maintain two identical environments; cut over traffic all-at-once from blue to green | Instant rollback (flip traffic back), clean environment for each deploy | Doubles infrastructure cost; all-or-nothing means bugs affect all users immediately after cutover |
| Canary | Route a small percentage (1-5%) of traffic to new version; expand gradually | Limits blast radius; catches real-world issues before full rollout; no idle infrastructure | Longer rollout window; requires good metrics to decide when to expand |
| Feature flags | Code for both old and new behavior ships; flag controls which path is active per user | Decouple deploy from release; target specific users/cohorts; instant kill switch | Technical debt if flags aren't cleaned up; code complexity with many flags active simultaneously |
Database Schema Migration Without Downtime
Schema migrations on live databases require careful sequencing to avoid locking tables or breaking running application instances.
Adding a Column (Safe)
-- Step 1: Add column as nullable (instant, no lock)
ALTER TABLE orders ADD COLUMN shipped_at TIMESTAMPTZ;
-- Step 2: Backfill in batches to avoid long locks
UPDATE orders SET shipped_at = created_at
WHERE id BETWEEN 1 AND 10000 AND shipped_at IS NULL;
-- Repeat in batches...
-- Step 3: Add NOT NULL constraint with DEFAULT (after backfill complete)
-- Postgres 11+: this is metadata-only, no table scan needed
ALTER TABLE orders ALTER COLUMN shipped_at SET DEFAULT now();
-- Adding NOT NULL on fully populated column — use NOT VALID trick:
ALTER TABLE orders ADD CONSTRAINT orders_shipped_at_not_null
CHECK (shipped_at IS NOT NULL) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_shipped_at_not_null;
-- VALIDATE acquires ShareUpdateExclusive lock (non-blocking for reads/writes)
Renaming a Column (Expand-Contract)
-- NEVER: ALTER TABLE orders RENAME COLUMN user_id TO customer_id;
-- This instantly breaks any running app that uses the old name.
--
-- INSTEAD (expand-contract over 3 deploys):
-- Deploy 1: Add new column, dual-write in app code
ALTER TABLE orders ADD COLUMN customer_id BIGINT;
-- Deploy 2: Backfill, read from new column, still write to both
UPDATE orders SET customer_id = user_id WHERE customer_id IS NULL;
-- Deploy 3: App reads and writes only new column; drop old
ALTER TABLE orders DROP COLUMN user_id;
Managing Technical Debt at Scale
Technical debt is not a failure — it is a natural consequence of building under uncertainty. The failure is not tracking it, not paying it down strategically, and letting it compound to the point where velocity collapses.
- Make it visible — keep a tech debt backlog with severity and estimated cost to fix. Invisible debt is never prioritized.
- 20% time rule — reserve 20% of every sprint for debt paydown. Negotiate this explicitly with product; show the velocity impact of not doing it.
- Boy Scout Rule — leave the code slightly better than you found it. Rename a confusing variable, add a missing test, extract a helper function. No separate "cleanup PRs" — do it inline.
- Strangler Fig for legacy code — don't rewrite; incrementally replace. Each new feature added to the new module, not the old one. Old code shrinks naturally.
- Dependency risk — unmaintained dependencies are the highest-leverage debt to address. A dependency that hasn't had a security patch in 3 years will cost you during an incident, not during normal development.