System Design Refresher — Scalability, Caching

Table of Contents

1. Thinking Like a Principal Engineer

The difference between a senior engineer and a principal engineer is not technical depth alone — it is the ability to reason about systems at the organizational and multi-year timescale. Every architectural decision carries a cost that compounds over time: in maintenance burden, in team cognitive overhead, in ability to hire, and in operational complexity.

Reversible vs. Irreversible Decisions

Jeff Bezos called these Type 1 and Type 2 decisions. Type 1 decisions are one-way doors — hard to reverse once made. Type 2 are two-way doors. Most architectural decisions are actually reversible with enough effort, but the cost of reversal varies enormously.

Decision	Reversibility	Cost to reverse	Implication
Choice of primary database	Low	Very high (migration, downtime risk)	Spend more time upfront; involve team
API contract with external clients	Low	High (breaking changes, versioning)	Design carefully; version from day 1
Internal service decomposition	Medium	Medium (refactor, redeploy)	Start monolith, split when needed
Choice of queue/message broker	Medium	Medium (consumer rewrite)	Abstract behind interface early
Caching strategy	High	Low (change cache logic)	Iterate quickly; optimize last
Logging format	High	Low (script migration)	Move fast; standardize later

Decision Framework: Reversibility Test

Before over-analyzing a decision, ask: "How long and how painful would it be to reverse this in 18 months?" If the answer is "a few days of work," move quickly. If the answer is "a 6-month migration," treat it as a one-way door and slow down.

The "Boring Technology" Principle

Dan McKinley's essay on choosing boring technology remains one of the most practically useful things written about engineering. The core insight: every technology you adopt has an "innovation token" cost. You only have a few tokens to spend. Use them on things that actually differentiate your business, not on being early adopter of every new database or orchestration framework.

Postgres is boring. It handles 99% of relational workloads, has been production-hardened for 30 years, has excellent tooling, and every engineer knows it. Start here.
Redis is boring. Use it for caching, queues, rate limiting, pub/sub. Don't invent your own.
Kafka is boring for high-throughput event streaming. But it is also operationally heavy — use a managed version or consider a simpler alternative first.
New technology earns a token only if it solves a problem that your boring stack demonstrably cannot.

Build vs. Buy Framework

Factor	Build	Buy/Use OSS
Differentiating capability?	Yes — core to your product	No — commodity infrastructure
Team expertise available?	Yes, deeply	No, and hiring is slow
OSS/vendor solution exists?	Doesn't exist or is poor fit	Mature, widely adopted
Scale requirement unusual?	Yes, at extreme scale	No, typical scale
Compliance/data sovereignty?	Required on-prem/custom	Vendor meets requirements
Time to market?	Not a constraint	Urgent — ship fast

Vendor Lock-in: Cost, Not Fear

Fear of vendor lock-in is often over-weighted. The real question is: what is the switching cost if this vendor raises prices 10x, gets acquired, or sunsets the product? Quantify it concretely. For most teams, S3-compatible storage, managed Postgres, and a message queue involve acceptable lock-in because alternatives exist and migration is bounded. Building your own Kafka or your own object store to "avoid lock-in" trades vendor lock-in for self-inflicted operational lock-in.

Scope vs. Depth

Principal engineers think in systems, not components. When you propose or review a design, ask:

What does this fail like? Not "will it fail" — everything fails. How does the failure manifest? Is it silent or noisy? Does it cascade?
What does this look like at 10x traffic? Not "will it scale" — will it scale cheaply, or will the cost curve become a crisis?
What does the on-call rotation look like? Every system you build, someone maintains at 3am. Is the runbook writable?
What is the blast radius? If this service is down, what else breaks? Is the dependency graph a tree or a tangle?
How will this be tested? If you can't describe how to test it, you don't understand it well enough to build it.

2. Monolith vs. Microservices

The monolith vs. microservices debate is almost always answered wrong in interviews and architecture reviews because engineers focus on technical properties instead of organizational context. The right answer depends on team size, deployment maturity, and domain clarity — not on what Netflix does.

When the Monolith Is Correct

The monolith is the correct default for most systems. It is not a legacy artifact — it is the appropriate architecture when:

Team size is under ~50 engineers (a single team can own the codebase)
The domain model is still being discovered (premature service boundaries calcify bad decisions)
Operational maturity is low (you don't have container orchestration, service mesh, distributed tracing in place)
The performance budget is tight (in-process calls are 100-1000x cheaper than network calls)
Deployment frequency is low (a single deployable unit is far simpler to manage)

Real-World Context: Shopify

Shopify runs a Rails monolith called "The Monolith" at massive scale — hundreds of engineers, millions of merchants, tens of thousands of requests per second. They invested in a concept called "modular monolith" with strict module boundaries enforced at compile time. Shopify's reasoning: the cost of distributed systems (network failures, eventual consistency, distributed transactions) is not justified unless team autonomy genuinely requires independent deployability.

When to Decompose — Migration Triggers

Decompose when you have evidence, not anticipation, of these problems:

Independent scalability — one part of the system needs 10x more resources than another, and they're coupled in deployment
Team autonomy — two teams are blocking each other on every release because they share deployment artifacts
Technology divergence — one subsystem genuinely needs a different runtime (e.g., ML inference in Python, real-time in Rust)
Isolation of failure — one component failing should not bring down unrelated features
Regulatory/compliance boundaries — PCI, HIPAA, or data residency requirements that demand hard isolation

Domain-Driven Service Boundaries

When you do decompose, use Domain-Driven Design bounded contexts as the primary partitioning criterion, not technical layers (never "put all the database access in one service"). A bounded context is a semantic boundary where a term has a consistent meaning. The canonical mistake is splitting horizontally (API service, business logic service, data service) instead of vertically by domain capability.

# Bad decomposition: technical layers
# Every feature change touches all three services
api-gateway/
business-logic-service/
data-access-service/

# Good decomposition: domain capability (vertical slice)
# Each service is independently deployable and owns its data
order-service/        # owns orders table, order events
inventory-service/    # owns inventory, warehouse events
payment-service/      # owns transactions, PCI-scoped
notification-service/ # consumes events, sends emails/SMS
user-service/         # identity, auth, profiles

The "Distributed Monolith" Anti-Pattern

A distributed monolith has the worst of both worlds: the operational complexity of microservices with the coupling of a monolith. You have it when:

Services cannot be deployed independently (shared database, tight synchronous coupling)
A change in Service A requires coordinated changes in Services B, C, and D
Services share a database and write to each other's tables directly
Integration tests require all services to be running simultaneously

Shared Database = Distributed Monolith

The single most common way teams create a distributed monolith is by extracting services while keeping a shared database. If two services both write to the same table, they are not independent services — they are a distributed monolith with a network hop. Each service must own its data. Coordination happens through events or APIs, never through direct table access.

Service Size Heuristics

There is no canonical answer to "how big should a microservice be," but useful heuristics:

Two-pizza team rule (Amazon) — one team should be able to own, build, and operate the service
Single deployment unit — if you always deploy two services together, merge them
Independent database — the service owns all its data; no shared tables with other services
Fits in a developer's head — a new team member should understand the service's full scope in a day

3. Synchronous vs. Asynchronous Communication

REST vs. gRPC vs. GraphQL

Protocol	Best For	Avoid When	Key Tradeoff
REST	Public APIs, CRUD operations, browser clients, simple integrations	High-throughput internal RPC, strongly typed contracts	Human-readable, ubiquitous tooling; but verbose, no schema enforcement
gRPC	Internal service-to-service, polyglot environments, streaming, performance-critical paths	Browser clients (requires grpc-web), simple public APIs	Typed contracts, binary efficiency, streaming; but harder to debug, requires Protobuf
GraphQL	Aggregation layer for heterogeneous backends, mobile clients with bandwidth constraints, rapid UI iteration	Simple CRUD APIs, ML pipelines, non-client-facing services	Client-driven queries reduce over/under-fetching; but N+1 query problem, caching is harder, complex auth
WebSocket	Real-time bidirectional (chat, live dashboards, games)	Request-response patterns, sporadic updates	True push; but stateful connections complicate load balancing and scaling

Decision Framework: Protocol Choice

Default to REST for external-facing APIs. Use gRPC for internal service calls where performance or type safety matters. Use GraphQL only when you have multiple clients (web, iOS, Android) with genuinely different data requirements and a team with GraphQL experience. GraphQL's operational complexity is underestimated — DataLoader, persisted queries, depth limiting, and schema stitching all require dedicated investment.

Event-Driven Architecture

Event-driven systems decouple producers from consumers temporally. The producer fires an event and does not wait for or know about downstream consumers. This enables loose coupling, independent scaling, and auditable state — but introduces eventual consistency and makes distributed tracing essential.

# Event-driven flow (order placement example)
#
# Synchronous (tight coupling):
# Client → OrderService → InventoryService → PaymentService → NotificationService
#          (blocks 500ms waiting for chain)
#
# Event-driven (loose coupling):
# Client → OrderService → publishes "order.created" event
#                              │
#           ┌──────────────────┼──────────────────┐
#           ▼                  ▼                  ▼
#  InventoryService    PaymentService    NotificationService
#  (reserves stock)   (charges card)    (sends confirmation)
#  (independent,       (independent,     (independent,
#   own retry logic)    own retry logic)   own retry logic)

Message Queues vs. Event Streams

Property	Message Queue (RabbitMQ, SQS)	Event Stream (Kafka, Kinesis)
Consumer model	Competing consumers (one gets the message)	Consumer groups (each group gets all messages)
Message retention	Deleted after acknowledgment	Retained for configurable period (days/forever)
Replay	Not possible after consumption	Any consumer can replay from offset 0
Ordering guarantee	FIFO within queue (not across)	Ordered within partition
Consumer count scaling	Add consumers freely	Bounded by partition count
Use case	Task dispatch, work queue, RPC	Event log, audit trail, stream processing
Throughput	Moderate (tens of thousands/sec)	High (millions/sec with partitioning)

Real-World Context: LinkedIn and Kafka

Kafka was built at LinkedIn to solve a problem that message queues could not: multiple independent consumers (analytics, search indexing, recommendations) all needed to read the same activity events, but a traditional queue deletes messages once consumed. Kafka's log abstraction — where messages are retained and consumers track their own offset — enables this fan-out pattern at scale.

Request-Reply vs. Fire-and-Forget

Not all async communication is fire-and-forget. The Saga pattern and choreography-based workflows often need to correlate events across multiple services. Key patterns:

Fire-and-forget — producer publishes event, does not wait. Works for notifications, analytics, non-critical side effects
Async request-reply — producer sends message with correlation ID, polls or subscribes to response. Works for long-running jobs
Saga (choreography) — each service listens for events, does its work, emits next event. No central coordinator. Easier to scale, harder to debug
Saga (orchestration) — a central orchestrator tells each service what to do. Easier to understand and trace, introduces coordination bottleneck

4. Database Selection

Decision Framework: Default to Postgres

Start with Postgres unless you have a specific, demonstrated reason not to. It handles relational data (obviously), JSON/JSONB documents, full-text search, time-series with partitioning, geospatial with PostGIS, and vector similarity with pgvector. The "but Postgres won't scale" objection is almost always premature — Shopify, GitHub, and Instagram ran Postgres at enormous scale. Add a specialized database only when Postgres demonstrably cannot meet the requirement.

Database Taxonomy and Selection

Type	Examples	Data Model	Use When	Avoid When
Relational	Postgres, MySQL, CockroachDB	Tables, rows, ACID	Structured data, complex queries, strong consistency needed, joins are natural	Schema changes happen constantly, write throughput > 100k/sec/node
Document	MongoDB, Firestore, CouchDB	JSON documents, hierarchical	Deeply nested documents, schema varies per record, developer velocity matters more than query flexibility	You need joins across document types frequently, strong consistency required
Wide-Column	Cassandra, DynamoDB, Bigtable	Row key + column families	Write-heavy, high throughput, known access patterns, global distribution needed	Ad-hoc queries, unknown access patterns, complex relationships
Graph	Neo4j, Amazon Neptune, JanusGraph	Nodes, edges, properties	Relationship-heavy: social graphs, recommendation, fraud detection, knowledge graphs	Data is primarily tabular, graph traversals are rare
Time-Series	InfluxDB, TimescaleDB, Prometheus	Metric + timestamp + tags	Monitoring, IoT, financial tick data, any append-only time-stamped data	Data is not primarily time-ordered, you need complex joins
OLAP Columnar	ClickHouse, BigQuery, Snowflake, Redshift	Columnar, MPP	Analytical queries over billions of rows, aggregations, BI dashboards	Transactional workloads, point lookups, frequent updates
Search	Elasticsearch, OpenSearch, Typesense	Inverted index, denormalized	Full-text search, faceted search, log analytics	Primary source of truth, strong consistency required

Postgres vs. DynamoDB — The Real Comparison

This is the most common "relational vs. NoSQL" choice in cloud-native systems. It is not a question of scale — it is a question of access patterns.

Dimension	Postgres	DynamoDB
Query flexibility	Arbitrary SQL — any predicate, any join	Access by partition key or GSI only — define access patterns upfront
Consistency	Serializable ACID by default	Eventually consistent by default; strong consistency optional at higher cost
Scalability	Vertical + read replicas; horizontal via Citus or partitioning	Horizontal by design; automatic partitioning, effectively unbounded throughput
Operational overhead	Requires sizing, vacuuming, index maintenance, failover management	Fully managed; no servers, automatic scaling, multi-region built-in
Cost model	Fixed (instance-based) — predictable at stable load	Pay per request — can be expensive at high uniform throughput, cheap at spiky load
Schema flexibility	Strict schema with migrations	Schemaless — different items in same table can have different attributes
Transactions	Full multi-table ACID transactions	Limited transactions (up to 100 items, same region)

Real-World Context: DynamoDB at Amazon

Amazon's shopping cart uses DynamoDB because it has a known access pattern (get/put by customer ID), requires single-digit millisecond latency globally, and must never lose data. These requirements — known access patterns, extreme availability, global distribution — justify DynamoDB's operational constraints. The rule of thumb: if you can describe all your access patterns before writing code, DynamoDB is viable. If you'll discover new queries as the product evolves, use Postgres.

Cassandra vs. Postgres for Write-Heavy Workloads

Cassandra is optimized for write throughput at the expense of read flexibility. It uses an LSM tree internally (appends go to a memtable, flushed to SSTables, compacted periodically). Reads may need to merge multiple SSTables. This makes writes extremely fast but reads more expensive than B-tree databases unless they align with the partition key.

Use Cassandra when: millions of writes/second, known read access patterns (always by user_id + time range), multi-datacenter replication needed, time-series or event log data
Use Postgres when: complex queries, OLTP, relational integrity needed, team is SQL-fluent, scale is not yet proven to exceed Postgres's ceiling

5. Caching Strategies

Caching is one of the highest-leverage performance interventions available, but it introduces a consistency problem that is frequently underestimated. Every caching layer is a replica that may lag behind the source of truth. The complexity is not in reading from the cache — it is in deciding when the cache is stale.

Core Cache Patterns

Pattern	How it works	Best for	Cache consistency
Cache-aside (lazy)	App checks cache; on miss, reads DB, populates cache, returns	Read-heavy, tolerates brief staleness	Cache may be stale until TTL expires
Read-through	Cache library handles miss; fetches from DB transparently	Same as cache-aside but with cleaner app code	Same as cache-aside
Write-through	Every write goes to cache AND DB synchronously	Write + read on same data, consistency critical	Cache always current; write latency doubled
Write-behind (write-back)	Write to cache, async flush to DB	Write-heavy, eventual persistence acceptable	Risk of data loss if cache crashes before flush
Refresh-ahead	Cache proactively refreshes popular entries before TTL expires	Known hot keys (top products, trending content)	Near-current; complex to implement

Cache Invalidation — The Hard Problem

Phil Karlton's observation that "there are only two hard things in computer science: cache invalidation and naming things" understates the problem in distributed systems. Options for invalidation:

TTL-based expiry — simplest approach; tolerate staleness up to TTL. Works for most cases. Set TTL based on acceptable lag, not system capability.
Event-driven invalidation — publish "entity updated" events, consumers invalidate their caches. Requires event infrastructure; risk of invalidation storms
Write-through with versioning — include version/etag in cache key; clients use latest version. Solves staleness but increases key complexity
Database-triggered invalidation — use CDC (Change Data Capture) to watch DB writes and invalidate downstream caches automatically

Redis vs. Memcached

Dimension	Redis	Memcached
Data structures	Strings, hashes, lists, sets, sorted sets, streams, geospatial, HyperLogLog	Strings only
Persistence	RDB snapshots + AOF write-ahead log	None — purely ephemeral
Replication	Built-in primary/replica + Redis Cluster for sharding	Manual client-side sharding only
Lua scripting	Yes — atomic multi-step operations	No
Memory efficiency	Slightly higher overhead per key	More memory-efficient for pure string caching
Use Redis for	Almost everything — rate limiting, session store, distributed lock, leaderboard, pub/sub, job queue
Use Memcached for	Pure string cache where maximum memory efficiency matters and you need multi-threaded scaling across many CPUs

Thundering Herd and Cache Stampede

When a popular cache entry expires simultaneously, thousands of requests may all miss the cache and hit the database at once. Mitigations:

Jitter on TTL — randomize TTL slightly (TTL = base + random(0, base * 0.1)) to prevent synchronized expiry
Mutex/lock on cache miss — only one request rebuilds; others wait. Reduces DB load but adds latency for waiting requests
Probabilistic early expiration — proactively refresh before TTL expires with probability proportional to proximity to expiry (XFetch algorithm)
Stale-while-revalidate — serve stale data immediately, trigger async refresh in background

When Caching Hurts

Caching is not free. Avoid caching when: data changes very frequently (cache hit rate will be low), data must be strongly consistent (cache introduces lag), cache is larger than the working set (you're caching cold data), or you're caching the result of a bug (a cache obscures a slow query that should be fixed). Always measure cache hit rate. A hit rate below 70-80% means the cache may be adding latency for cache misses with little benefit.

6. Data Pipeline Architecture

Batch vs. Streaming vs. Micro-batch

Approach	Latency	Throughput	Complexity	Use When
Batch	Minutes to hours	Highest (amortize overhead)	Low	Daily reporting, ML training, data warehouse loads, non-time-sensitive aggregations
Micro-batch (Spark Streaming)	Seconds to minutes	High	Medium	Near-real-time dashboards, fraud detection where seconds are acceptable
Streaming (Flink, Kafka Streams)	Milliseconds	Lower per event	High	Real-time fraud, live recommendations, alerting on metrics, order processing pipelines

Lambda vs. Kappa Architecture

# Lambda Architecture (two processing paths)
#
# Raw Events ──┬──▶ Batch Layer (Spark/Hadoop) ──▶ Batch Views
#              │                                         │
#              └──▶ Speed Layer (Flink/Storm)  ──▶ RT Views ──▶ Serving Layer
#                                                         │           (merges views)
#
# Problem: two codebases doing same logic; hard to keep in sync

# Kappa Architecture (stream-only)
#
# Raw Events ──▶ Event Log (Kafka) ──▶ Stream Processor ──▶ Serving Store
#                    │                  (Flink / Spark)
#                    └── replay for backfill / recompute
#
# Benefit: one codebase, replay for correctness
# Cost: streaming semantics are harder; exactly-once is complex

Decision Framework: Lambda vs. Kappa

Lambda is not "wrong" — it emerged because streaming was hard and batch was reliable. Today, Flink and Spark Structured Streaming have matured enough that Kappa is the better default for new systems. Use Lambda only when you need the batch layer for historical reprocessing at massive scale where streaming would be too slow or expensive. Most teams choosing Lambda today are over-engineering.

ELT vs. ETL in the Modern Stack

The shift from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform) was enabled by cheap columnar storage and SQL-centric warehouses like Snowflake and BigQuery. The transformation now happens inside the warehouse using SQL (dbt) rather than in an external compute layer.

ETL — transform before loading. Required when data is too large to load raw, or when PII must not enter the warehouse. Tool: Spark, Python scripts, custom pipelines.
ELT — load raw, transform in-warehouse with SQL. Faster to operationalize, cheaper at modern cloud scale, transformations version-controlled with dbt. The modern default.

Medallion Architecture (Bronze / Silver / Gold)

# Medallion / Lakehouse layering
#
# Bronze (Raw)   — exactly as received from source; append-only
#                  format: JSON/Avro, partitioned by ingestion date
#                  retention: long (years) — source of truth for replay
#
# Silver (Curated) — cleaned, deduplicated, validated, typed
#                    format: Parquet with schema
#                    joins: reference data merged in; nulls handled
#
# Gold (Aggregated) — business-level aggregates, ready for BI/ML
#                     format: wide denormalized tables
#                     audience: analysts, dashboards, features stores

When to Use Spark vs. Flink vs. Simple SQL

Tool	Use For	Avoid When
Simple SQL / dbt	Batch transformations in a warehouse, business logic expressed cleanly in SQL, ELT pipelines	Complex stateful logic, streaming, data outside the warehouse
Spark	Large-scale batch processing, ML training data prep, joining massive datasets, Python/Scala ecosystem	Low-latency streaming (<1s), simple transformations that SQL handles
Flink	True streaming with event time semantics, stateful stream processing, exactly-once guarantees, low-latency	Batch-only workloads, teams without streaming expertise
Kafka Streams / ksqlDB	Simple stream transformations, filtering, enrichment within Kafka ecosystem	Complex stateful joins across multiple streams, ML inference

7. Storage & Object Store Patterns

S3 as the Universal Storage Layer

S3 (and compatible APIs: GCS, Azure Blob, MinIO) has become the de facto storage layer for modern data systems. Its durability (11 nines), cost (fraction of a cent per GB-month), and ubiquitous support make it the right default for any data that is accessed less than constantly and needs to outlive the compute that created it.

Data lakes — raw data in S3; query engines (Athena, Trino, Spark) read directly. No need to load into a database first.
ML artifacts — model weights, training checkpoints, feature datasets. Never store in a relational database.
Log archives — ship logs to S3 for long-term retention; query with Athena when needed rather than keeping in hot Elasticsearch.
Static assets — images, videos, PDFs. S3 + CloudFront is the standard CDN pattern.

Data Lake vs. Data Warehouse vs. Lakehouse

Concept	Storage	Query	Consistency	Cost
Data Lake	S3/GCS, raw files (Parquet, Avro, JSON)	Athena, Trino, Spark — on-demand	No ACID; manual schema management	Storage cheap; query compute charged per scan
Data Warehouse	Proprietary columnar (Snowflake, BigQuery managed storage)	SQL, fast indexing, metadata caching	ACID within warehouse	Storage + compute bundled or separate
Lakehouse	S3/GCS + Delta Lake / Apache Iceberg / Apache Hudi table format	Spark, Trino, Flink — open format	ACID via transaction log on S3	Storage at S3 rates + separate compute

Real-World Context: Databricks and the Lakehouse

Databricks coined "Lakehouse" to describe Delta Lake — a transaction log layered over S3/GCS that provides ACID guarantees, schema enforcement, and time travel (query table as it was at any past commit). Apache Iceberg (used by Netflix, Apple) and Apache Hudi (used by Uber) solve the same problem with different trade-offs. All three are "open table formats" — the data lives in your S3 bucket in open Parquet files, and multiple query engines can read the same table.

File Format Selection

Format	Orientation	Schema	Use For	Avoid For
Parquet	Columnar	Embedded (self-describing)	Analytics, ML training, data warehouse ingestion, default choice for structured data	Row-at-a-time access (point lookups), streaming writes
ORC	Columnar	Embedded	Hive/Spark workloads where ORC's bloom filters provide better predicate pushdown	Non-Hadoop ecosystems; Parquet is more widely supported
Avro	Row	External (Schema Registry)	Kafka serialization, row-at-a-time streaming writes, schema evolution with registry	Analytical queries (columnar is far faster)
JSON/JSONB	Row	None (schemaless)	Bronze/raw layer in lake, API payloads, debugging	Any analytical workload (50-100x slower than Parquet at scale)
CSV	Row	None	Human inspection, one-time exports, small files	Production data pipelines (no types, fragile to escaping, slow)

Hot / Warm / Cold Tiering

Storage tiering dramatically reduces cost by moving data to cheaper storage as it ages. S3 supports tiering natively with lifecycle policies:

# S3 Lifecycle policy example
Rules:
  - ID: "tier-old-data"
    Status: Enabled
    Transitions:
      # Hot: S3 Standard (immediate access, ~$0.023/GB/month)
      - Days: 30
        StorageClass: STANDARD_IA   # Warm: infrequent access (~$0.0125/GB/month)
      - Days: 90
        StorageClass: GLACIER_IR    # Cold: instant retrieval (~$0.004/GB/month)
      - Days: 365
        StorageClass: DEEP_ARCHIVE  # Archive: ~$0.00099/GB/month, 12hr retrieval

8. Search & Indexing

Search Engine Comparison

Engine	Strengths	Weaknesses	Best For
Elasticsearch / OpenSearch	Mature, full-featured, distributed, excellent aggregations, observability ecosystem (ELK/EFK)	Operationally complex, memory-hungry, eventual consistency, cluster management is non-trivial	Log analytics, application search with complex facets, enterprise search at scale
Typesense	Typo-tolerant, fast, easy to operate (single binary), instant search UX	Smaller feature set, no native distributed mode (Typesense Cloud for HA)	E-commerce product search, SaaS app search, developer-friendly deployments
Meilisearch	Typo-tolerant, zero-config, instant results, great developer UX	Not designed for >100M documents, limited aggregations	Small to medium datasets, developer tools, documentation search
Postgres Full-Text	Already in your stack, GIN indexes, tsvector/tsquery, no extra service	No typo tolerance, weaker relevance ranking, no semantic search	Internal search where basic FTS is sufficient, avoiding extra infrastructure
Algolia	Fastest time-to-market, excellent hosted experience, SDKs for every platform	Expensive at scale, vendor lock-in, limited customization	Early-stage products where search UX matters but engineering capacity is limited

Full-Text vs. Vector Search

Traditional full-text search relies on an inverted index: a mapping from terms to document IDs. It is fast, exact, and deterministic. Vector (semantic) search embeds documents and queries into a high-dimensional space and retrieves nearest neighbors. It understands synonyms and paraphrases but is slower and less interpretable.

Dimension	Full-Text (BM25)	Vector (Semantic)
Precision on exact terms	Excellent — matches exact keywords	Weaker — may return thematically related but term-distant results
Synonym handling	Requires manual synonym dictionaries	Handles synonyms automatically via embedding
Query speed	Milliseconds — inverted index lookup	Tens to hundreds of ms — ANN scan over vectors
Infrastructure	Elasticsearch, Postgres FTS	pgvector, Pinecone, Weaviate, Qdrant, Milvus
Best for	Code search, legal documents, exact product SKUs	Q&A, RAG (retrieval-augmented generation), semantic similarity

Decision Framework: Hybrid Search

For AI applications (RAG pipelines, semantic search), the state of the art is hybrid search: combine BM25 keyword scores with vector similarity scores using Reciprocal Rank Fusion (RRF). This outperforms either approach alone. Elasticsearch 8.x, OpenSearch, and pgvector with a custom RRF query all support this pattern. Start with Postgres + pgvector for apps under a few million documents — you almost certainly don't need a dedicated vector database until you've proven the use case.

Inverted Index Internals (Why It Matters)

Understanding the inverted index explains Elasticsearch's behavior in production. When you index a document, the analyzer chain runs: tokenization → lowercasing → stop word removal → stemming → tokens stored in the inverted index with position and frequency metadata.

Term frequency (TF) — how often a term appears in a document. Higher = more relevant for that term.
Inverse document frequency (IDF) — how rare a term is across all documents. Rare terms are more discriminative.
BM25 — the modern scoring formula used by Elasticsearch. Improves on TF-IDF with saturation (term frequency stops growing in relevance beyond a point) and document length normalization.
Mapping explosion — the #1 operational issue in Elasticsearch. If you index arbitrary JSON keys as fields, cardinality explodes. Always use explicit mappings in production; never let ES auto-create mappings on dynamic nested objects.

9. API Design Decisions

REST Maturity Model (Richardson)

Most "REST" APIs are Level 2. True Level 3 (HATEOAS) is rarely implemented in practice. Know the model to know where to invest.

Level	Characteristics	Example
Level 0 — POX	HTTP as transport; single endpoint, all via POST	`POST /api {"action": "getUser", "id": 1}`
Level 1 — Resources	Multiple URIs, one per resource concept	`POST /users/1` for everything about user 1
Level 2 — HTTP Verbs	Uses GET/POST/PUT/DELETE correctly, HTTP status codes	`GET /users/1`, `DELETE /users/1` — industry default
Level 3 — HATEOAS	Responses include links to related actions (self-describing API)	Response body contains `{"links": {"delete": "/users/1"}}`

API Versioning Strategies

Versioning is about managing change for clients you do not control. For internal APIs, you can break the contract and coordinate across teams. For external/public APIs, you must version.

URL path versioning (/v1/users, /v2/users) — most common, explicit, easy to route. Downside: duplication of routes.
Header versioning (API-Version: 2024-01-01) — cleaner URLs; Stripe uses date-based versioning. Harder to test with a browser.
Query parameter (?version=2) — simple but pollutes query params; acceptable for internal tools.
No versioning (additive only) — only make backward-compatible changes: add fields, don't remove or rename. Feasible for JSON APIs where clients ignore unknown fields.

Real-World Context: Stripe's API Versioning

Stripe versions its API by date (e.g., 2024-06-20). When you create an API key, it is pinned to the current API version. Your integration never breaks unless you explicitly upgrade. Stripe maintains backward compatibility for every historical version in production — an enormous investment, but it is why developers trust Stripe's API. For most teams: URL path versioning is sufficient. Start with /v1/ even if you're not sure you'll need v2.

Pagination Patterns

Pattern	How It Works	Pros	Cons
Offset / Limit	`?offset=20&limit=10`	Simple, SQL-native, supports jumping to page N	Skips/duplicates on concurrent inserts; slow at large offsets (DB must scan and discard)
Cursor-based	`?after=eyJpZCI6MTAwfQ==` (opaque encoded cursor)	Stable pagination under concurrent writes; O(1) page fetch	Cannot jump to arbitrary page; cursor encodes position (opaque to client)
Keyset pagination	`?after_id=100&after_created=2024-01-01`	Performant with index on sort key; stable	Multi-column sort keys add complexity
Page token (Google style)	`?pageToken=<server-generated>`	Server-side encoding hides internals; supports cursor	Cannot bookmark specific page; tokens may expire

Decision Framework: Pagination

Use cursor-based pagination for any collection that can receive concurrent writes. Offset pagination is fine for read-heavy, low-churn datasets (e.g., a static product catalog). For internal data engineering APIs, keyset pagination on a monotonic ID or timestamp is performant and simple. Never use offset pagination on collections with millions of rows — OFFSET 1000000 LIMIT 20 requires the database to scan and discard 1 million rows.

Idempotency and Retry Safety

Any API that triggers a state change should support idempotency keys. The client generates a unique key per logical operation; the server deduplicates using that key. This enables safe retries in the presence of network failures — critical for payment APIs, order creation, and any operation with real-world side effects.

# Idempotent POST: client generates idempotency key
POST /v1/payments
Idempotency-Key: a4e9-4b27-8f31-...

# Server behavior:
# 1. Check if key seen before → return cached response
# 2. If not → process, store result against key
# 3. Return result

# Idempotency window: typically 24 hours
# Storage: Redis with TTL or a database table indexed by key

Rate Limiting Strategies

Token bucket — tokens refill at a constant rate; allows bursting up to bucket capacity. Redis + Lua script for atomic check-and-decrement. Most common for APIs.
Fixed window counter — count requests per minute per client. Simple; susceptible to burst at window boundary (100 requests at :59 + 100 at :01 = 200 in 2 seconds).
Sliding window log — exact but memory-expensive (store every request timestamp).
Sliding window counter — approximate sliding window using two fixed windows. Accurate enough for most use cases.

10. Authentication & Authorization at Scale

OAuth2 / OIDC Flow Selection

Flow	Use For	Key Characteristic
Authorization Code + PKCE	Web apps, mobile apps, any user-facing client	Gold standard; short-lived auth code exchanged for tokens; PKCE prevents code interception
Client Credentials	Service-to-service (machine-to-machine)	No user involved; service authenticates with client_id + secret directly
Device Code	Smart TVs, CLI tools, headless devices	User completes auth on separate device; polling for token
Implicit (deprecated)	Avoid — replaced by Auth Code + PKCE	Tokens in URL fragment; vulnerable to history/referrer leakage
Resource Owner Password	Avoid — legacy; breaks trust model	Client receives user's credentials; violates zero-trust

JWT vs. Session Tokens

Dimension	JWT (stateless)	Session Token (stateful)
Server state	None — token is self-contained	Session store (Redis/DB) required
Revocation	Hard — token is valid until expiry; needs blocklist	Easy — delete session from store
Payload size	Large — claims embedded (200-2000 bytes)	Small — opaque random token (~32 bytes)
Horizontal scaling	Stateless — any replica validates signature	Requires shared session store (Redis)
Clock skew sensitivity	High — `exp` check requires synchronized clocks	None — server controls expiry
Recommendation	Use JWTs for service-to-service, short-lived tokens (<15 min). Use session tokens for user sessions where revocation matters. Never store sensitive data in JWT payload (it's only Base64 encoded, not encrypted unless you use JWE).

RBAC vs. ABAC vs. ReBAC

Model	Policy basis	Example	Use When
RBAC (Role-Based)	User's roles	User has "admin" role → can do everything	Simple, coarse-grained permissions; most B2B SaaS
ABAC (Attribute-Based)	User + resource + environment attributes	User in "engineering" department can access resources tagged "internal" before 5pm	Complex policies with context (time, location, resource attributes)
ReBAC (Relationship-Based)	Graph of relationships between entities	User can edit document only if user is member of the document's owning team	Google Drive, GitHub permissions — hierarchical resource ownership

Real-World Context: Google Zanzibar

Google's Zanzibar paper (2019) describes the authorization system that powers Google Drive, YouTube, and Google Maps. It uses a relationship tuple model: (object, relation, user). "User:alice can viewer document:doc123" is a tuple. Authorization is a graph traversal. Open-source implementations include OpenFGA (by Auth0/Okta) and SpiceDB (by Authzed). This model scales to billions of objects and handles complex inheritance rules — but is significant operational investment. Use RBAC until it hurts.

Service-to-Service Authentication

mTLS (mutual TLS) — both client and server present certificates. Cryptographic identity, no shared secrets. Service meshes (Istio, Linkerd) handle certificate rotation automatically. The gold standard for zero-trust networks.
JWT with shared signing key or public key verification — service A signs a JWT; service B verifies with known public key. Simpler than mTLS, but key rotation requires coordination.
API keys — symmetric secret in the Authorization header. Simple; poor rotation story; no standard revocation mechanism. Acceptable for internal services if secrets management is solved.
SPIFFE/SPIRE — standards-based identity framework for workloads. Issues SVIDs (X.509 or JWT) to workloads; integrates with Kubernetes service accounts.

11. Scaling Patterns

Vertical vs. Horizontal Scaling

Dimension	Vertical (Scale Up)	Horizontal (Scale Out)
Mechanism	Bigger machine (more CPU/RAM)	More machines of same size
Simplicity	Simple — no code changes needed	Complex — requires stateless design or distributed state management
Cost curve	Linear to exponential (largest instances have poor price/performance)	Linear — commodity machines
Failure domain	Single point of failure	Partial failure possible; one node failing doesn't kill all capacity
Best for	Databases (Postgres scales surprisingly far vertically), early-stage systems	Stateless app servers, caches, workers that can run on any node

Database Sharding Strategies

Sharding is the last resort, not the first response, to database scaling. Before sharding, exhaust: read replicas, caching, query optimization, vertical scaling, table partitioning.

Range sharding — shard by range of values (e.g., user_id 1-1M on shard 1). Simple; hotspots likely if access is skewed (e.g., new users concentrate on last shard).
Hash sharding — hash the shard key; distribute evenly. No hotspots; but range queries require hitting all shards.
Directory-based sharding — a lookup table maps entity ID to shard. Flexible; but the lookup table is a bottleneck and SPOF.
Consistent hashing — distribute data across a ring; adding/removing nodes redistributes only a fraction of keys. Used in Cassandra, DynamoDB.

Cross-Shard Queries Are Expensive

After sharding, any query that cannot be routed to a single shard must fan out to all shards and merge results. This makes aggregate queries, joins across shard boundaries, and ordered pagination extremely expensive. Design your shard key around your most frequent queries. If every important query requires hitting all shards, you have the wrong shard key.

Connection Pooling — The Real Bottleneck

Database connection limits are frequently the actual scaling bottleneck before raw query throughput. Postgres handles ~100-300 concurrent connections before performance degrades (each connection is a process fork). At 1000 application servers with 10 connections each, you exceed Postgres's limit immediately.

# Connection math — common failure mode
#
# 200 app pods × 10 connections each = 2,000 connections to Postgres
# Postgres max_connections default = 100  →  immediate failure
# Even with max_connections=1000  →  severe performance degradation
#
# Solution: PgBouncer (connection pooler) in transaction mode
# 200 app pods → PgBouncer (20 server connections) → Postgres
#
# PgBouncer transaction mode:
# - Client connection is a logical connection only
# - Physical DB connection held only during active transaction
# - 1,000 client connections → 20 actual Postgres connections

Load Balancing Algorithms

Algorithm	How It Works	Best For
Round Robin	Requests distributed sequentially to each server	Homogeneous servers, similar request cost
Least Connections	Route to server with fewest active connections	Long-lived connections (WebSockets, gRPC streams)
Least Response Time	Route to server with lowest latency + fewest connections	Heterogeneous servers, latency-sensitive workloads
IP Hash / Sticky Sessions	Hash client IP to consistently route to same server	Stateful servers (avoid with stateless design)
Random with two choices	Pick two servers randomly; send to the less loaded	Proven near-optimal distribution with minimal coordination cost

12. Reliability & Resilience

Circuit Breakers, Retries, and Timeouts

Reliability patterns are not optional in distributed systems — they are the difference between isolated failures and cascading outages. The three must be implemented together; any one alone is insufficient.

Pattern	Purpose	Without It
Timeout	Bound the maximum time waiting for a response	Threads block indefinitely; thread pool exhaustion cascades into full service outage
Retry with backoff	Recover from transient failures automatically	Transient network errors cause permanent failures; users see errors that would have resolved in 100ms
Circuit breaker	Stop calling a failing service to allow it to recover	Avalanche of retries overwhelms a struggling service, preventing recovery (retry storm)
Bulkhead	Isolate failure domains with separate thread pools / resource pools	One slow dependency exhausts all threads, taking down unrelated features

# Circuit breaker state machine
#
# CLOSED (normal) ──[failure threshold exceeded]──► OPEN (rejecting all calls)
#                                                        │
#                                          [after timeout]
#                                                        ▼
#                                               HALF-OPEN (probe calls)
#                                              /                      \
#                                    [success]                    [failure]
#                                        ▼                            ▼
#                                   CLOSED                          OPEN
#
# Configuration example:
# failure_threshold = 50%        # open after 50% of calls fail in window
# window_size = 10 calls         # evaluate over last N calls
# recovery_timeout = 30s         # wait before trying HALF-OPEN
# probe_success_count = 3        # N successes to close circuit

Retry Strategy Details

Exponential backoff — wait time doubles on each retry: 1s, 2s, 4s, 8s... Reduces retry storms.
Jitter — add randomness to backoff to prevent synchronized retry spikes across many clients: wait = base * 2^attempt + random(0, base)
Retry budget — limit the percentage of total requests that can be retries to prevent amplifying load during an outage. Google SRE recommends no more than 10% of requests should be retries.
Idempotency required — only retry idempotent operations automatically. Never auto-retry a non-idempotent POST without an idempotency key.

SLOs as Architecture Contracts

An SLO (Service Level Objective) is not a monitoring concern — it is an architectural contract that shapes design decisions. When you define "99.9% of requests complete in <200ms," you have made concrete decisions about caching, database indexing, async processing, and circuit breaker thresholds.

Error budget — at 99.9% availability, you have 8.7 hours/year of allowed downtime. That budget determines how aggressively you can deploy, how much technical debt you can carry, and what on-call response time is acceptable.
Alerting on SLO burn rate — alert when your error budget is burning faster than expected, not on raw error rate. A 5% burn rate matters differently on Monday vs. Friday before a holiday.
Multi-window alerting (Google's recommendation) — alert on 1-hour window (fast burn detection) AND 6-hour window (slow burn detection) simultaneously to reduce false positives.

Multi-Region: Active-Active vs. Active-Passive

Topology	How It Works	RPO/RTO	Cost	Complexity
Active-Passive	One region serves traffic; passive region is hot standby. Failover promotes passive to active.	RPO: seconds to minutes (replication lag). RTO: minutes (DNS propagation, warm-up)	Medium — passive region is mostly idle	Lower — no conflict resolution needed
Active-Active	Both regions serve traffic; writes may occur in both simultaneously.	RPO: near-zero (synchronous replication) or seconds (async). RTO: near-zero (traffic already flowing)	High — full capacity in both regions	Very high — requires conflict resolution, distributed transactions, or eventual consistency design

Active-Active Is Not Automatically Better

Active-active is only worth its complexity if your availability requirement is above 99.99% or your latency requirement demands routing to the nearest region. The distributed write coordination required for true active-active is one of the hardest problems in systems engineering. Most teams are better served by active-passive with automated failover and a well-tested runbook. The question to ask: "Would our users notice a 3-minute outage once a year?" If no, active-passive is sufficient.

13. Data Engineering at Scale

Schema Registry and Schema Evolution

In event-driven systems, schema management is a first-class concern. A message schema changes; consumers break. A schema registry centralizes schema versions and enforces compatibility rules before a producer can publish a new schema.

Backward compatibility — new schema can read old data (new consumer, old messages). Add fields with defaults; never delete required fields.
Forward compatibility — old schema can read new data (old consumer, new messages). Add only optional fields; consumers ignore unknown fields.
Full compatibility — both directions. Most restrictive; safest for long-lived topics.

// Avro schema evolution example — backward compatible change
// Old schema
{ "type": "record", "name": "User",
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "email", "type": "string"}
  ]
}

// New schema — adds optional field with default (backward compatible)
{ "type": "record", "name": "User",
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "email", "type": "string"},
    {"name": "phone", "type": ["null", "string"], "default": null}
  ]
}

Change Data Capture (CDC)

CDC is the pattern of capturing every row-level change in a database (insert, update, delete) and streaming those changes to downstream systems. This enables real-time data pipelines without expensive polling queries, and is the foundation for the "database as event log" architecture.

Log-based CDC (Debezium, Maxwell) — reads database write-ahead log (WAL in Postgres, binlog in MySQL). Zero impact on source database performance. Captures all changes including deletes.
Query-based CDC (Airbyte, Fivetran) — periodically polls for rows where updated_at > last_checkpoint. Simple; misses hard deletes; adds query load.
Trigger-based CDC — database triggers write to a changelog table. High overhead; generally avoid in production.

Exactly-Once Semantics

At-least-once delivery (retries on failure) is easy. Exactly-once processing is hard because it requires coordination between the consumer, the processing logic, and the output store. The usual pragmatic approach is "at-least-once delivery + idempotent processing = effectively exactly-once."

Delivery Guarantee	Risk	How to Achieve
At-most-once	Data loss possible — message consumed, crash before ack	Ack before processing (simplest, worst for data pipelines)
At-least-once	Duplicate processing — message processed, crash before ack → retry	Ack only after successful processing; idempotent consumers handle duplicates
Exactly-once	Neither loss nor duplicates	Kafka transactions + transactional producers/consumers; or idempotent upserts in sink

Orchestration: Airflow vs. Dagster vs. Prefect

Tool	Paradigm	Strengths	Weaknesses	Choose When
Airflow	DAG-based, Python operators, scheduler-centric	Mature, large ecosystem, battle-tested at scale, rich UI	Dynamic DAGs are painful, no native data-awareness, scaling scheduler is hard, slow iteration	Team has Airflow experience, large existing DAG library, GCP (Cloud Composer)
Dagster	Asset-centric, data-aware, software-defined assets	Strong type system, lineage tracking, testing-friendly, asset graph UI, built-in sensors	Steeper learning curve, smaller community than Airflow	New data platform build, want strong testing + lineage from day 1, dbt + Spark integration
Prefect	Flow-centric, Python-native, easy to start	Minimal boilerplate, hybrid execution model, fast onboarding	Asset-awareness less mature than Dagster	Team wants fastest time-to-value, mixed Python/cloud workloads

Data Quality Frameworks

Data quality failures are production incidents. The pipeline runs successfully but produces wrong numbers. Prevention requires validation at every layer:

Great Expectations / Soda Core — declarative expectations on data (row count, null rate, value distribution). Run at each pipeline stage; fail fast on violations.
dbt tests — built-in: uniqueness, not-null, accepted values, referential integrity. Custom: arbitrary SQL assertions. Run in CI on every PR.
Schema enforcement — use Schema Registry for streams; use Pydantic or Pandera for DataFrame validation in Python pipelines.
Statistical monitoring — Monte Carlo, Bigeye — detect anomalies in row counts, null rates, and distribution shifts without writing explicit expectations. Useful when you can't enumerate all expected values.

14. ML System Design

Training vs. Serving Architecture

# ML system high-level architecture
#
# ┌────────────────────────────────────────────────────────────┐
# │  Offline (Training)                                        │
# │  Feature Store ──▶ Training Pipeline ──▶ Model Registry   │
# │  (batch features)    (Spark/Python)       (versioned)      │
# └────────────────────────────────────────────────────────────┘
#          │                                       │
#          │  Feature pipeline populates           │ Deploy
#          ▼                                       ▼
# ┌────────────────────────────────────────────────────────────┐
# │  Online (Serving)                                          │
# │  Request ──▶ Feature Store ──▶ Model Server ──▶ Response   │
# │              (online features,  (TF Serving,              │
# │               low-latency)       TorchServe)              │
# └────────────────────────────────────────────────────────────┘

Feature Stores

A feature store solves the "train-serve skew" problem: training features are computed differently from serving features, leading to degraded model performance in production. The store provides a single definition for each feature that is used in both training (batch reads) and serving (low-latency online reads).

Dimension	Offline Store	Online Store
Purpose	Historical feature retrieval for training	Low-latency feature retrieval for serving
Storage	Data warehouse (BigQuery, Snowflake, Parquet on S3)	Redis, DynamoDB, Bigtable, Cassandra
Latency target	Batch (minutes to hours)	< 10ms P99
Operations	Point-in-time correct joins, training dataset generation	Lookup by entity key (user_id, item_id)
Examples	Feast, Tecton, Hopsworks, Vertex AI Feature Store, SageMaker Feature Store

Batch vs. Real-Time Inference

Type	Latency	Use For	Example
Batch inference	Minutes to hours	Pre-compute predictions for all entities; results stored in a lookup table	Daily churn prediction for all users; nightly recommendation pre-computation
Near-real-time	Seconds	Scoring on recent events via streaming pipeline	Fraud scoring on completed transaction via Kafka + Flink
Online inference	Milliseconds	Score per request in the user's critical path	Search ranking, content recommendation, ad CTR prediction

Decision Framework: When to Use Online Inference

Online (real-time) inference is the most expensive option in latency budget, GPU cost, and complexity. Choose it only when the prediction must be personalized to the current request context (e.g., "given this exact search query + user context right now"). For recommendations that depend only on user history, batch inference to a lookup table is simpler and cheaper. For fraud, the transaction doesn't exist until the request arrives, so online inference is required.

Model Monitoring and Drift Detection

A model that passes all offline tests can degrade silently in production. Production ML monitoring is a different discipline from application monitoring:

Data drift — the distribution of input features shifts from the training distribution. Detect with KL divergence, Population Stability Index (PSI), or statistical tests on feature distributions.
Concept drift — the relationship between features and labels changes (e.g., a fraud model trained pre-COVID on different behavior patterns). Detect by tracking label distribution and model performance metrics over time.
Prediction drift — the model's output distribution changes even if inputs haven't. Leading indicator of problems before ground truth is available.
Shadow deployment — new model serves shadow traffic (requests routed to it but responses discarded); compare predictions to production model offline before promoting.

15. Observability-Driven Design

The Three Pillars — and What They Actually Mean

Pillar	What It Gives You	Key Tooling	Design Implication
Metrics	Numeric time-series aggregates — request rate, latency percentiles, error rate, saturation	Prometheus, Datadog, CloudWatch, Grafana	Instrument every service boundary with rate/error/duration (RED) or utilization/saturation/errors (USE)
Logs	Discrete events with context — what happened at a specific moment	ELK/EFK, Datadog Logs, CloudWatch Logs, Loki	Structure logs as JSON from day 1; include trace_id, user_id, request_id in every log line
Traces	End-to-end request path across services — latency breakdown by hop	Jaeger, Zipkin, Datadog APM, AWS X-Ray, OpenTelemetry	Instrument at service entry/exit with span propagation; traces become your architecture documentation

Structured Logging from Day 1

Unstructured log lines (ERROR: failed to process order 12345 for user 67890) require regex parsing to query. Structured JSON logs are queryable immediately with any log platform. The ROI compounds: every grep replaced by a Kibana filter saves minutes per incident.

// Bad: unstructured string
"ERROR: failed to process order 12345 for user 67890 in 2.3s"

// Good: structured JSON — every field queryable without parsing
{
  "level": "error",
  "timestamp": "2024-01-15T10:23:45.123Z",
  "service": "order-service",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "user_id": "67890",
  "order_id": "12345",
  "duration_ms": 2300,
  "error": "payment_service_timeout",
  "message": "order processing failed"
}

SLO-Based Alerting

Alerting on raw metrics (CPU > 80%, error rate > 1%) produces alert fatigue without actionability. Alerting on SLO burn rate gives you signal proportional to user impact.

Fast burn — alert if error budget burns at 14x in any 1-hour window (36x more errors than normal for 1 hour depletes 5% of monthly budget). Page on-call immediately.
Slow burn — alert if error budget burns at 6x in any 6-hour window. Ticket creation; investigate but no immediate page.
Exhaustion forecast — alert if current burn rate, if sustained, will exhaust monthly budget before month end. Planning signal, not an incident.

Cardinality Management in Metrics

High-cardinality labels in metrics systems cause memory explosions. Prometheus stores a time series per unique combination of label values. Adding user_id as a label to a request metric creates a new time series for every user — potentially millions.

High-Cardinality Labels Will Crash Your Metrics Store

Never use user_id, order_id, trace_id, IP address, or any unbounded identifier as a Prometheus label. Use low-cardinality labels only: service name, region, endpoint name, HTTP method, status code bucket. For high-cardinality analysis (per-user, per-request), use logs or traces — they are designed for it. Metrics are for aggregate trends, not individual events.

16. Migration & Evolution Strategies

Strangler Fig Pattern

The strangler fig is the canonical pattern for migrating from a legacy system without a big-bang rewrite. Name comes from the strangler fig vine that grows around a host tree and eventually replaces it. The key insight: you never turn off the old system until the new system fully handles its workload.

# Strangler Fig implementation via reverse proxy
#
# Phase 1: New system handles ONE capability; proxy routes that path to new
# Client ──▶ Reverse Proxy ──┬──▶ Legacy System (all paths)
#                             └──▶ New System (/api/orders only)
#
# Phase 2: Migrate more capabilities; adjust proxy routing
# Phase 3: Legacy system handles only remaining capabilities
# Phase 4: All capabilities migrated; decomission legacy
#
# Key principle: at every phase, both systems are running in production.
# You can pause migration, roll back routing, and observe real traffic.

Dual-Write Pitfalls

During a database migration, teams often "dual-write" to both the old and new system to keep them in sync. This pattern is fragile because writes to two systems are not atomic — one may succeed and the other fail, leaving them inconsistent.

Problem 1: Partial write failure — write succeeds in old DB, fails in new DB. Now systems diverge.
Problem 2: Ordering — concurrent writes may arrive in different order at each system, causing divergence.
Better approach: CDC-based migration — use Debezium to capture writes from the source DB and replicate to the target. The WAL is an ordered log; replication is reliable. Enable writes to the new system only after replication is caught up and verified.
Better approach: Expand-Contract — add new column (expand), backfill, dual-read (read both, prefer new), switch writes to new column only (contract), remove old column.

Blue-Green vs. Canary Deployments

Strategy	How It Works	Pros	Cons
Blue-Green	Maintain two identical environments; cut over traffic all-at-once from blue to green	Instant rollback (flip traffic back), clean environment for each deploy	Doubles infrastructure cost; all-or-nothing means bugs affect all users immediately after cutover
Canary	Route a small percentage (1-5%) of traffic to new version; expand gradually	Limits blast radius; catches real-world issues before full rollout; no idle infrastructure	Longer rollout window; requires good metrics to decide when to expand
Feature flags	Code for both old and new behavior ships; flag controls which path is active per user	Decouple deploy from release; target specific users/cohorts; instant kill switch	Technical debt if flags aren't cleaned up; code complexity with many flags active simultaneously

Database Schema Migration Without Downtime

Schema migrations on live databases require careful sequencing to avoid locking tables or breaking running application instances.

Adding a Column (Safe)

-- Step 1: Add column as nullable (instant, no lock)
ALTER TABLE orders ADD COLUMN shipped_at TIMESTAMPTZ;

-- Step 2: Backfill in batches to avoid long locks
UPDATE orders SET shipped_at = created_at
WHERE id BETWEEN 1 AND 10000 AND shipped_at IS NULL;
-- Repeat in batches...

-- Step 3: Add NOT NULL constraint with DEFAULT (after backfill complete)
-- Postgres 11+: this is metadata-only, no table scan needed
ALTER TABLE orders ALTER COLUMN shipped_at SET DEFAULT now();
-- Adding NOT NULL on fully populated column — use NOT VALID trick:
ALTER TABLE orders ADD CONSTRAINT orders_shipped_at_not_null
  CHECK (shipped_at IS NOT NULL) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_shipped_at_not_null;
-- VALIDATE acquires ShareUpdateExclusive lock (non-blocking for reads/writes)

Renaming a Column (Expand-Contract)

-- NEVER: ALTER TABLE orders RENAME COLUMN user_id TO customer_id;
-- This instantly breaks any running app that uses the old name.
--
-- INSTEAD (expand-contract over 3 deploys):
-- Deploy 1: Add new column, dual-write in app code
ALTER TABLE orders ADD COLUMN customer_id BIGINT;

-- Deploy 2: Backfill, read from new column, still write to both
UPDATE orders SET customer_id = user_id WHERE customer_id IS NULL;

-- Deploy 3: App reads and writes only new column; drop old
ALTER TABLE orders DROP COLUMN user_id;

Managing Technical Debt at Scale

Technical debt is not a failure — it is a natural consequence of building under uncertainty. The failure is not tracking it, not paying it down strategically, and letting it compound to the point where velocity collapses.

Make it visible — keep a tech debt backlog with severity and estimated cost to fix. Invisible debt is never prioritized.
20% time rule — reserve 20% of every sprint for debt paydown. Negotiate this explicitly with product; show the velocity impact of not doing it.
Boy Scout Rule — leave the code slightly better than you found it. Rename a confusing variable, add a missing test, extract a helper function. No separate "cleanup PRs" — do it inline.
Strangler Fig for legacy code — don't rewrite; incrementally replace. Each new feature added to the new module, not the old one. Old code shrinks naturally.
Dependency risk — unmaintained dependencies are the highest-leverage debt to address. A dependency that hasn't had a security patch in 3 years will cost you during an incident, not during normal development.

The Parallel Architecture Change Pattern

When you need to change a core architectural component (swap database, rewrite a service), run old and new in parallel with shadowed traffic rather than doing a flag day. Shadow the new system with a copy of production traffic; compare outputs. Only when the new system handles 100% of shadow traffic correctly do you start shifting real traffic. This removes the pressure of a hard cutover deadline and lets you find edge cases safely.