Distributed Systems Refresher
Consensus, replication, partitioning, consistency models, and practical patterns for modern software and data engineering
Table of Contents
Setup & Environment
3-Node etcd Cluster (Consensus Playground)
etcd is the backing store for Kubernetes and a canonical Raft implementation. Spin up three nodes locally to watch leader election and log replication in action.
# docker-compose.yml — 3-node etcd cluster
version: "3.8"
services:
etcd0:
image: quay.io/coreos/etcd:v3.5.9
environment:
ETCD_NAME: etcd0
ETCD_DATA_DIR: /etcd-data
ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
ETCD_ADVERTISE_CLIENT_URLS: http://etcd0:2379
ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
ETCD_INITIAL_ADVERTISE_PEER_URLS: http://etcd0:2380
ETCD_INITIAL_CLUSTER: etcd0=http://etcd0:2380,etcd1=http://etcd1:2380,etcd2=http://etcd2:2380
ETCD_INITIAL_CLUSTER_STATE: new
ports: ["2379:2379"]
etcd1:
image: quay.io/coreos/etcd:v3.5.9
environment:
ETCD_NAME: etcd1
ETCD_DATA_DIR: /etcd-data
ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
ETCD_ADVERTISE_CLIENT_URLS: http://etcd1:2379
ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
ETCD_INITIAL_ADVERTISE_PEER_URLS: http://etcd1:2380
ETCD_INITIAL_CLUSTER: etcd0=http://etcd0:2380,etcd1=http://etcd1:2380,etcd2=http://etcd2:2380
ETCD_INITIAL_CLUSTER_STATE: new
etcd2:
image: quay.io/coreos/etcd:v3.5.9
environment:
ETCD_NAME: etcd2
ETCD_DATA_DIR: /etcd-data
ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
ETCD_ADVERTISE_CLIENT_URLS: http://etcd2:2379
ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
ETCD_INITIAL_ADVERTISE_PEER_URLS: http://etcd2:2380
ETCD_INITIAL_CLUSTER: etcd0=http://etcd0:2380,etcd1=http://etcd1:2380,etcd2=http://etcd2:2380
ETCD_INITIAL_CLUSTER_STATE: new
# Start the cluster
docker compose up -d
# Watch leader election
etcdctl --endpoints=http://localhost:2379 endpoint status --write-out=table
# Put/get a key — observe that any node serves reads
etcdctl --endpoints=http://localhost:2379 put /config/feature_flag "enabled"
etcdctl --endpoints=http://localhost:2379 get /config/feature_flag
# Simulate leader failure — kill the leader container and watch re-election
docker compose stop etcd0
etcdctl --endpoints=http://localhost:2379 endpoint status --write-out=table
Tooling
# CLI tools for hands-on practice
brew install etcd # etcdctl for consensus experiments
brew install redis # redis-cli for replication / cache demos
brew install kafka # kafka-topics, kafka-console-producer/consumer
# Python clients
pip install grpcio redis kafka-python requests
# Chaos engineering (simulate partitions)
# Toxiproxy — inject latency/drops between containers
docker pull shopify/toxiproxy
Why Distributed Systems
Single-Machine Limits
Every computer has a ceiling. At some point you hit one of these walls:
- CPU: A single socket maxes out at ~96–128 cores (2025). Compute-heavy workloads (ML training, video encoding) exceed this.
- Memory: A single machine typically supports up to ~6 TB RAM. Some in-memory databases exceed this.
- Disk I/O: Even NVMe SSDs saturate at ~7 GB/s. A 10 PB data lake does not fit on one machine.
- Availability: Any single machine will fail (hardware fault, OS crash, maintenance). Uptime > 99.99% requires redundancy.
- Geographic latency: Serving users in Tokyo from a single US data center adds 150 ms of round-trip time — unavoidable with one location.
Horizontal vs Vertical Scaling
| Dimension | Vertical (Scale Up) | Horizontal (Scale Out) |
|---|---|---|
| Approach | Bigger machine (more RAM, cores) | More machines |
| Cost | Superlinear — doubling specs costs 4-8x | Linear — commodity hardware |
| Downtime | Usually requires reboot | Rolling updates, no downtime |
| Ceiling | Hard limit of largest available instance | Effectively unlimited |
| Complexity | Low — single process | High — distributed coordination needed |
| When to use | Stateful, hard-to-partition workloads | Stateless services, large data volumes |
The 8 Fallacies of Distributed Computing
Peter Deutsch (Sun Microsystems, 1994) identified eight assumptions developers incorrectly make about networks. Every one of these will eventually bite you in production:
- The network is reliable. Packets are dropped. NICs fail. Cables are unplugged.
- Latency is zero. A local function call takes nanoseconds. A network call takes milliseconds — 6 orders of magnitude slower.
- Bandwidth is infinite. Serializing a 1 GB object and sending it over a 1 Gbps link takes 8 seconds.
- The network is secure. Data traverses shared infrastructure. TLS, mTLS, and zero-trust are non-optional.
- Topology doesn't change. Load balancers reconfigure, containers restart, IPs change. Hard-coding addresses breaks things.
- There is one administrator. Multiple teams own different services. Coordination and contracts (APIs) matter.
- Transport cost is zero. Serialization (JSON, Protobuf), compression, and network I/O have real CPU and time costs.
- The network is homogeneous. Different services may run on different OS versions, hardware, and network segments with different MTUs.
CAP Theorem & PACELC
CAP
Eric Brewer's CAP theorem (2000, proved by Gilbert & Lynch 2002) states that a distributed system can guarantee at most two of the three following properties simultaneously:
- Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time.
- Availability (A): Every request receives a non-error response (but it may not contain the most recent write).
- Partition Tolerance (P): The system continues operating even when network messages between nodes are lost or delayed.
# ASCII: the CAP trade-off during a network partition
#
# Node A ----X---- Node B (partition: messages dropped)
#
# CP choice: refuse to serve reads/writes until partition heals
# (correct data, but some requests fail)
#
# AP choice: serve reads from local state, accept writes locally
# (system stays up, but nodes may diverge)
Real-World Examples
| System | Default behavior | Rationale |
|---|---|---|
| etcd / ZooKeeper | CP | Leader-based; refuses writes without quorum. Used for config/coordination where correctness is paramount. |
| Postgres (single primary) | CP | Synchronous standbys block on partition; async standbys can diverge (AP-like). |
| Cassandra / DynamoDB | AP (tunable) | Leaderless; always accepts writes, reconciles later. Tunable consistency per operation. |
| Redis Sentinel | CP (weak) | Promotes new primary on failure but may lose recent writes during promotion window. |
| Kafka (producer acks=all) | CP | Waits for ISR to acknowledge before confirming write. |
PACELC Extension
Daniel Abadi (2010) extended CAP with PACELC: even when there is no partition, you still must trade off Latency (L) against Consistency (C). Stronger consistency requires coordination — coordination takes time.
- PA/EL: During partition choose Availability; else choose Low latency. (Cassandra default, DynamoDB)
- PC/EC: During partition choose Consistency; else choose Consistency. (etcd, Google Spanner)
- PA/EC: During partition choose Availability; else choose Consistency. (MongoDB)
Spanner achieves PC/EC globally via GPS + atomic clocks (TrueTime) — consistency without sacrificing much latency, but only by spending heavily on infrastructure.
Consistency Models
Consistency models define the contract between a distributed system and its clients: what can a client observe after a write? Ordered from strongest to weakest:
Strong Consistency (Linearizability)
Linearizability is the gold standard. Every operation appears to take effect instantaneously at some single point in time, and all clients observe a single, consistent global order. Once a write completes, all subsequent reads — from any client, any node — see that write.
- Requires coordination (consensus) on every write.
- Examples: etcd, ZooKeeper, Google Spanner, CockroachDB.
- Cost: higher latency, lower throughput.
Sequential Consistency
Weaker than linearizability. Operations appear in a consistent order across all processes, but that order does not have to match real wall-clock time. All clients see the same sequence, but it may lag reality.
- Examples: Early CPU memory models, some distributed databases in single-DC mode.
Causal Consistency
If operation A causally precedes operation B (A happened before B, and B observed A's effects), then all nodes see A before B. Operations with no causal relationship may be seen in any order.
- Implemented via vector clocks or causal tokens.
- Examples: MongoDB (causal sessions), some Cassandra patterns.
- Good middle ground — much cheaper than linearizability, stronger than eventual.
Eventual Consistency
If no new writes are issued, all replicas will eventually converge to the same value. No guarantee on when, no guarantee on ordering.
- Examples: Cassandra (default), DynamoDB (default), DNS, S3.
- Cheapest option — great for shopping carts, social feeds, analytics.
- Bad for anything that requires "read your own writes" by default.
Session Guarantees
Practical consistency properties for a single client session:
- Read-your-writes: After a write, your subsequent reads reflect that write. Critical for user-facing apps (e.g., you post a comment and immediately see it).
- Monotonic reads: Once you read a value, you never read an older value. You won't "go back in time."
- Monotonic writes: Your writes are applied in the order you issued them.
- Writes follow reads: A write that follows a read is applied after the value you read.
Linearizability vs Serializability
These are often confused in interviews:
| Property | Linearizability | Serializability |
|---|---|---|
| Domain | Single operations (reads/writes) | Transactions (multiple ops) |
| Guarantee | Each op appears atomic at a point in time | Transaction outcome = some serial order |
| Respects real time? | Yes | No (serial order may not match wall clock) |
| Both together | Called "strict serializability" — the strongest practical guarantee. Used by Spanner, FoundationDB. | |
Replication
Replication keeps copies of data on multiple nodes for fault tolerance (survive node failures) and read scalability (spread read load). Three main topologies:
Replication Topologies
# Single-leader (most common)
#
# Client Writes
# |
# [Leader] ──replication──> [Follower 1]
# | [Follower 2]
# └──────────────────> [Follower 3]
#
# All writes go to the leader.
# Followers replicate the leader's write-ahead log.
# Reads can be served by followers (with possible staleness).
# Multi-leader (active-active)
#
# [DC East: Leader A] <──────────> [DC West: Leader B]
# | |
# [Follower] [Follower]
#
# Each DC accepts writes. Conflicts resolved on merge.
# Used for: multi-region writes, offline clients.
# Leaderless (Dynamo-style)
#
# Client ──W──> [Node 1] (W = write quorum, e.g., 2 of 3)
# ──W──> [Node 2]
# ──W──> [Node 3]
#
# No designated leader. Client writes to W nodes, reads from R nodes.
# Consistency: W + R > N (N = total replicas)
Synchronous vs Asynchronous Replication
| Synchronous | Asynchronous | |
|---|---|---|
| Write latency | High — waits for all replicas to ack | Low — returns after leader writes |
| Durability | No data loss on leader failure | Possible data loss (replication lag) |
| Availability | Lower — one slow replica blocks writes | Higher — replica failures don't block |
| Use when | Financial data, audit logs | User activity feeds, analytics events |
Replication Lag
Async replication means followers can be seconds (or minutes) behind the leader. Effects:
- Read-your-writes broken: User posts, is redirected to follower, doesn't see their post.
- Monotonic reads broken: Two requests hit different followers; second shows older data.
- Causality violations: User sees a reply before the original message.
Solutions: sticky sessions (always read from same replica), read from primary for sensitive operations, use causal consistency tokens.
Conflict Resolution
- Last-write-wins (LWW): Highest timestamp wins. Simple but lossy — concurrent writes from two clients silently drops one.
- CRDTs (Conflict-free Replicated Data Types): Data structures that merge deterministically. G-counters, OR-Sets, LWW-Register. Used in Redis, Riak, collaborative editors.
- Application-level: Expose conflicts to the application (Amazon shopping cart). Application decides how to merge.
- Operational transformation: Used in Google Docs for real-time collaborative editing.
Practical: Postgres & Redis Replication
# Postgres streaming replication setup
# In postgresql.conf on primary:
wal_level = replica
max_wal_senders = 5
wal_keep_size = 256MB # keep 256MB of WAL for replicas
# In pg_hba.conf on primary (allow replica user):
# host replication replicator replica_ip/32 md5
# On replica — pg_basebackup then recovery.conf (PG < 12) or
# standby.signal + postgresql.conf (PG >= 12):
primary_conninfo = 'host=primary_ip port=5432 user=replicator'
# touch standby.signal
# Check replication lag (run on primary):
SELECT client_addr,
state,
sent_lsn - write_lsn AS write_lag_bytes,
sent_lsn - flush_lsn AS flush_lag_bytes,
sent_lsn - replay_lsn AS replay_lag_bytes
FROM pg_stat_replication;
# Redis replication — one-liner on replica
redis-cli REPLICAOF primary_host 6379
# Check replication status
redis-cli INFO replication
# master_replid, master_repl_offset, replica_lag seconds
# Kafka replication — ISR (In-Sync Replicas)
# A partition is replicated to ISR nodes.
# Producer with acks=all waits for all ISR to write.
# min.insync.replicas=2 — at least 2 nodes must ack (prevents split-brain)
kafka-topics.sh --describe --topic orders --bootstrap-server localhost:9092
# Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Partitioning (Sharding)
When a dataset is too large for one node, you partition it across multiple nodes. Each partition owns a subset of the data. Also called sharding.
Partitioning Strategies
Range-Based Partitioning
Assign contiguous key ranges to partitions: keys A–M go to Shard 1, N–Z go to Shard 2. Natural for time-series data (partition by date).
- Pro: Range queries are efficient — scan one partition.
- Con: Hotspots. If most writes go to "today's" partition, that node is overloaded while others sit idle.
Hash-Based Partitioning
Hash the partition key and assign to a bucket: partition = hash(key) % N. Spreads load evenly.
- Pro: Uniform load distribution, no hotspots for random keys.
- Con: Range queries require hitting all partitions (scatter-gather).
Consistent Hashing
Simple hash(key) % N fails when you add or remove nodes — it remaps almost all keys. Consistent hashing solves this:
# Consistent hashing ring (conceptual)
#
# 0
# |
# Node C | Node A
# |
# 270 --+-- 90
# |
# Node D | Node B
# |
# 180
#
# Keys map to the first node clockwise from hash(key).
# Adding Node E between A and B remaps only the keys
# that used to map to B in A's arc — not the entire dataset.
#
# Virtual nodes: each physical node gets multiple points on the ring
# (e.g., 150 virtual nodes per server). This evens out load
# when nodes have different capacities.
Used by: Cassandra (virtual nodes), DynamoDB, Redis Cluster (16,384 hash slots), Memcached.
Hotspots & Rebalancing
Even hash partitioning can create hotspots if your key distribution is skewed. Classic example: a celebrity's user_id generates 10,000x more traffic than average users.
- Salting: Append a random suffix to the hot key (
user_123_0throughuser_123_9), spreading writes across partitions. Reads must query all suffixes and aggregate. - Dedicated partition: Detect hot keys and route them to their own oversized partition.
Practical Examples
# Kafka partitions — choosing a good partition key
# BAD: partition by region (only 4 regions = 4 max parallelism)
# GOOD: partition by user_id (thousands of users = high parallelism)
# Create topic with 12 partitions, replication factor 3
kafka-topics.sh --create \
--topic order-events \
--partitions 12 \
--replication-factor 3 \
--bootstrap-server localhost:9092
# DynamoDB partition key guidance:
# - High cardinality (many distinct values)
# - Uniform access pattern (no hot key)
# - Avoid: status (only ~5 values), country (195 values but uneven)
# - Good: user_id, device_id, order_id (UUIDs)
# Vitess (MySQL sharding) — VSchema defines sharding key
# schema.json
{
"sharded": true,
"vindexes": {
"hash": { "type": "hash" }
},
"tables": {
"orders": {
"columnVindexes": [
{ "column": "customer_id", "name": "hash" }
]
}
}
}
# All queries that include customer_id are routed to the correct shard.
# Cross-shard queries (no customer_id) fan out to all shards.
Consensus
The consensus problem: how do multiple nodes agree on a single value, even in the presence of failures? Consensus is needed for leader election, distributed locks, and replicated state machines.
Raft Algorithm
Raft (Ongaro & Ousterhout, 2014) was designed to be understandable. It decomposes consensus into three sub-problems:
1. Leader Election
- Time is divided into terms (monotonically increasing integers).
- Each term begins with an election. A node starts as a Follower.
- If a follower receives no heartbeat from a leader within an election timeout (150–300 ms), it becomes a Candidate and requests votes.
- A candidate wins if it receives votes from a majority (>N/2) of nodes.
- Only one leader per term. If two candidates split votes, a new election starts in the next term.
2. Log Replication
- All writes go to the Leader. The leader appends to its log and sends
AppendEntriesRPCs to followers. - An entry is committed once a majority of nodes have written it to their logs.
- Committed entries are durable — they will never be lost even if the current leader fails.
- Followers apply committed entries to their state machines in order.
3. Safety Guarantee
Raft's key safety property: if two log entries have the same index and term, they contain the same command, and all log entries before them are also identical. This means leaders are always elected with the most up-to-date log — a node won't vote for a candidate whose log is less up-to-date than its own.
# Watching Raft in action with our etcd cluster
# etcdctl uses the Raft protocol internally
# See who the current leader is
etcdctl --endpoints=http://localhost:2379 endpoint status \
--cluster --write-out=table
# +------------------+------------------+---------+...
# | ENDPOINT | ID | VERSION |...
# +------------------+------------------+---------+...
# | http://etcd0:2379 | 8e9e05c52164694d | true | <- IS LEADER
# | http://etcd1:2379 | ... | false |
# | http://etcd2:2379 | ... | false |
# Stop the leader — watch re-election (takes ~300ms)
docker compose stop etcd0
sleep 1
etcdctl --endpoints=http://localhost:2379 endpoint status \
--cluster --write-out=table
# A different node is now leader
Paxos (Brief)
Paxos (Lamport, 1989) was the original consensus algorithm. It has two phases: Prepare/Promise and Accept/Accepted. Multi-Paxos adds a leader optimization for repeated consensus. Paxos is famously difficult to understand and implement correctly. Raft was explicitly designed as a more understandable alternative; most new systems use Raft.
When You Need Consensus
- Distributed lock / mutex
- Leader election for a stateful service
- Replicated state machine (database replication)
- Atomic broadcast (all-or-nothing delivery to all nodes)
- Configuration store (who is the current primary?)
When you do NOT need consensus: if your operation is idempotent and you can tolerate brief inconsistency, use optimistic concurrency (version numbers + CAS) instead of a full consensus round.
Practical: etcd, ZooKeeper, Consul
# etcd: distributed config / service discovery (Kubernetes uses it)
etcdctl put /services/api/primary "10.0.0.1:8080"
etcdctl get /services/api/primary
# Distributed lock with etcd (lease-based)
# Lease expires if client crashes — no deadlocks
etcdctl lease grant 30 # 30-second lease, returns lease_id
etcdctl put /locks/job-processor "worker-42" --lease=lease_id
etcdctl lease keepalive lease_id # renew while working
# ZooKeeper: leader election recipe
# Create ephemeral sequential znode: /election/leader-0000000001
# Node with lowest sequence number is leader.
# Others watch the next-lower znode (not the leader) to avoid herd effect.
# Consul: service mesh + health checking + KV store
consul kv put feature/dark-mode enabled
consul kv get feature/dark-mode
consul lock /locks/deploy "deploy.sh" # acquires lock, runs command
Distributed Transactions
Two-Phase Commit (2PC)
2PC coordinates a transaction across multiple participants (databases, services) so that either all commit or all abort.
# 2PC Protocol
#
# Phase 1 — PREPARE
# Coordinator → all participants: "Prepare to commit"
# Each participant: write to WAL, acquire locks, reply VOTE-YES or VOTE-NO
#
# Phase 2 — COMMIT (if all voted YES) / ABORT (if any voted NO)
# Coordinator → all participants: "Commit" or "Abort"
# Each participant: apply or undo, release locks, reply ACK
#
# Problem: Coordinator failure during Phase 2
# ─────────────────────────────────────────────
# Participants voted YES and are now BLOCKED — they hold locks
# but don't know whether to commit or abort.
# Recovery requires reading the coordinator's durable log.
# System is unavailable until coordinator recovers.
Saga Pattern
Instead of a distributed lock, a Saga decomposes a long-running transaction into a sequence of local transactions, each with a compensating transaction that undoes it if a later step fails.
# Saga: E-commerce checkout
#
# Step 1: Reserve inventory Compensate: Release reservation
# Step 2: Charge payment Compensate: Refund payment
# Step 3: Schedule fulfillment Compensate: Cancel fulfillment
# Step 4: Send confirmation email (no compensation needed — idempotent)
#
# If Step 3 fails → execute Step 3's compensate → Step 2's compensate
# → Step 1's compensate
#
# Key insight: no distributed locks. Each step commits immediately.
# Failure is handled by running compensations in reverse.
Choreography vs Orchestration
| Choreography | Orchestration | |
|---|---|---|
| How | Services react to events (no central coordinator) | Central orchestrator tells each service what to do |
| Coupling | Loose — services only know about events | Tighter — orchestrator knows all services |
| Visibility | Hard to trace the full saga flow | Easy — all state in orchestrator |
| Tools | Kafka, RabbitMQ events | Temporal, AWS Step Functions, Cadence |
| When to use | Simple flows, loose coupling desired | Complex flows, need full observability |
# Temporal workflow orchestration (Python SDK)
# Temporal handles retries, timeouts, and state durability automatically
from temporalio import workflow, activity
from temporalio.client import Client
from temporalio.worker import Worker
@activity.defn
async def reserve_inventory(order_id: str, sku: str) -> bool:
# Call inventory service — if this crashes, Temporal retries it
return await inventory_service.reserve(order_id, sku)
@activity.defn
async def charge_payment(order_id: str, amount: float) -> str:
return await payment_service.charge(order_id, amount)
@activity.defn
async def release_inventory(order_id: str, sku: str) -> None:
await inventory_service.release(order_id, sku)
@workflow.defn
class CheckoutWorkflow:
@workflow.run
async def run(self, order_id: str, sku: str, amount: float) -> str:
reserved = await workflow.execute_activity(
reserve_inventory, order_id, sku,
start_to_close_timeout=timedelta(seconds=10)
)
if not reserved:
return "out_of_stock"
try:
charge_id = await workflow.execute_activity(
charge_payment, order_id, amount,
start_to_close_timeout=timedelta(seconds=30)
)
return charge_id
except Exception:
# Compensate: release inventory
await workflow.execute_activity(release_inventory, order_id, sku)
raise
Clocks & Ordering
Why Wall Clocks Are Unreliable
Every node has its own hardware clock. Clocks drift — they can be off by tens to hundreds of milliseconds even with NTP synchronization. You cannot use wall-clock timestamps to determine the order of events across nodes. A write at 12:00:00.001 on Node A may actually have happened after a write at 12:00:00.005 on Node B if Node A's clock is ahead.
Lamport Timestamps
Leslie Lamport (1978) introduced logical clocks. Rule: if event A happened before event B, then timestamp(A) < timestamp(B). The converse is not guaranteed.
# Lamport clock implementation
class LamportClock:
def __init__(self):
self.time = 0
def tick(self):
"""Increment before each local event."""
self.time += 1
return self.time
def update(self, received_time: int):
"""On receiving a message, advance clock to max + 1."""
self.time = max(self.time, received_time) + 1
return self.time
# Usage: every message carries the sender's clock value.
# Receiver calls update(msg.timestamp) before processing.
# Guarantees: if A -> B (A causally precedes B), then L(A) < L(B).
# Does NOT guarantee: L(A) < L(B) implies A -> B (concurrent events
# may have any order).
Vector Clocks
Vector clocks track causality precisely. Each node maintains a vector of counters — one per node. Two events are either causally ordered or concurrent (incomparable).
# Vector clock — detect causality and concurrency
class VectorClock:
def __init__(self, node_id: str, nodes: list[str]):
self.node_id = node_id
self.clock = {n: 0 for n in nodes}
def tick(self):
self.clock[self.node_id] += 1
return dict(self.clock)
def update(self, received: dict):
for node, t in received.items():
self.clock[node] = max(self.clock.get(node, 0), t)
self.clock[self.node_id] += 1
def happens_before(self, a: dict, b: dict) -> bool:
"""True if a causally precedes b."""
return all(a[n] <= b[n] for n in a) and any(a[n] < b[n] for n in a)
def concurrent(self, a: dict, b: dict) -> bool:
"""True if a and b are concurrent (neither precedes the other)."""
return not self.happens_before(a, b) and not self.happens_before(b, a)
# DynamoDB uses vector clocks to detect conflicting writes.
# When concurrent writes are detected, the application must resolve them.
Google Spanner TrueTime
Spanner uses GPS receivers and atomic clocks in every data center. The TrueTime API returns a time interval [earliest, latest] bounding the true current time. Before committing, Spanner waits out the uncertainty interval — guaranteeing that the commit timestamp is strictly in the past for all subsequent readers. This delivers linearizability globally without logical clocks.
Event Ordering in Kafka
Kafka guarantees ordering only within a single partition. Events on partition 0 are ordered; events across partitions 0 and 1 have no ordering guarantee. This is why choosing a good partition key matters for event-driven systems that require causal ordering (e.g., all events for a given user_id go to the same partition).
Failure Modes & Fault Tolerance
Types of Failures
- Crash failure: Node stops responding. Simplest to handle — node is either up or down.
- Network partition: Nodes are alive but cannot communicate. Most insidious — each side thinks the other is down.
- Split-brain: Two partitioned groups each elect a leader and start accepting writes independently. Reconciliation is painful.
- Byzantine failure: Node behaves arbitrarily (sends incorrect or malicious data). Requires Byzantine fault-tolerant (BFT) algorithms — used in blockchain, not typical enterprise systems.
- Slow node / grey failure: Node is alive but extremely slow. Timeouts are the only detection mechanism, but they must be tuned carefully.
Failure Detection
# Heartbeat: node A sends periodic "I'm alive" messages to node B.
# If B receives no heartbeat within timeout, it marks A as failed.
#
# Timeout tuning dilemma:
# - Too short: false positives (slow node marked as failed)
# - Too long: slow recovery time after real failures
# - Rule of thumb: timeout = 99th-percentile RTT * 3 + safety margin
#
# Phi Accrual Failure Detector (Cassandra, Akka):
# Instead of binary up/down, computes a "suspicion level" phi(t).
# phi increases as the time since last heartbeat grows.
# Application sets threshold: phi > 8 → treat as failed.
# Adapts to network conditions dynamically.
Circuit Breaker Pattern
Prevent cascading failures when a downstream service is struggling. The circuit has three states:
import time
from enum import Enum
class State(Enum):
CLOSED = "closed" # Normal — requests flow through
OPEN = "open" # Failing — requests blocked immediately
HALF_OPEN = "half_open" # Testing — one probe request allowed
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.failures = 0
self.state = State.CLOSED
self.opened_at = None
def call(self, func, *args, **kwargs):
if self.state == State.OPEN:
if time.time() - self.opened_at > self.timeout:
self.state = State.HALF_OPEN
else:
raise Exception("Circuit OPEN — request blocked")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
self.failures = 0
self.state = State.CLOSED
def _on_failure(self):
self.failures += 1
if self.failures >= self.failure_threshold:
self.state = State.OPEN
self.opened_at = time.time()
Retry with Exponential Backoff + Jitter
import random
import time
def retry_with_backoff(func, max_retries=5, base_delay=0.1, max_delay=30.0):
"""
Retry with full jitter: delay = random(0, min(max_delay, base * 2^attempt))
Full jitter prevents thundering herd — all retrying clients don't
hammer the recovering server at the same moment.
"""
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise
cap = min(max_delay, base_delay * (2 ** attempt))
delay = random.uniform(0, cap) # full jitter
time.sleep(delay)
Practical: Kubernetes Probes
# Kubernetes liveness + readiness probes
# livenessProbe: is the container alive? If not, restart it.
# readinessProbe: is the container ready for traffic? If not, remove from LB.
# startupProbe: is the container done initializing? (prevents premature liveness kills)
spec:
containers:
- name: api
image: myapp:latest
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3 # restart after 3 consecutive failures
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2 # remove from LB after 2 consecutive failures
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # up to 30 * 10s = 5 min for slow startup
periodSeconds: 10
Message Queues & Event Streaming
Delivery Semantics
| Semantic | Guarantee | Implication | Use when |
|---|---|---|---|
| At-most-once | Message delivered 0 or 1 times | Possible data loss | Metrics, telemetry — occasional loss OK |
| At-least-once | Message delivered 1 or more times | Possible duplicates — consumer must be idempotent | Most event-driven systems |
| Exactly-once | Delivered exactly 1 time | No loss, no duplicates — expensive | Financial transactions, billing |
Kafka Deep Dive
# Kafka: distributed, partitioned, replicated commit log
#
# Producer → [Partition 0] [Partition 1] [Partition 2] → Consumer Group
# offset 0 |
# offset 1 ← consumer tracks its offset |
# offset 2 |
# ┌──────┴──────┐
# Consumer A Consumer B
# (partition 0) (partition 1)
#
# Consumer group: each partition is consumed by exactly one consumer in the group.
# Multiple consumer groups can independently consume the same topic (event replay).
# Producer with idempotent exactly-once (Kafka transactions)
kafka-console-producer.sh \
--bootstrap-server localhost:9092 \
--topic orders \
--producer-property enable.idempotence=true \
--producer-property acks=all
# Consumer group management
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group payment-processor
# Shows LAG (messages behind) per partition — key operational metric
# Kafka producer with transactions (exactly-once semantics)
from kafka import KafkaProducer
import json
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
enable_idempotence=True, # idempotent delivery (deduplication)
acks='all', # wait for all ISR to ack
retries=5,
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
transactional_id='payment-processor-1' # enables transactions
)
producer.init_transactions()
try:
producer.begin_transaction()
producer.send('payments', {'order_id': '123', 'amount': 99.99})
producer.send('order-events', {'order_id': '123', 'status': 'paid'})
producer.commit_transaction()
except Exception as e:
producer.abort_transaction()
raise
SQS & RabbitMQ
# SQS: visibility timeout pattern
# When a consumer reads a message, it becomes invisible for `visibility_timeout`.
# If consumer crashes before deleting, message reappears — another consumer retries.
# Dead Letter Queue (DLQ): after N failed retries, message moved to DLQ for inspection.
import boto3
sqs = boto3.client('sqs', region_name='us-east-1')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789/orders'
# Receive (message invisible for 30s)
response = sqs.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=10,
VisibilityTimeout=30,
WaitTimeSeconds=20 # long polling — reduces empty responses
)
for msg in response.get('Messages', []):
try:
process(msg['Body'])
# Delete only after successful processing
sqs.delete_message(
QueueUrl=queue_url,
ReceiptHandle=msg['ReceiptHandle']
)
except Exception:
# Don't delete — message will reappear after visibility timeout
pass
Distributed Storage
Storage System Taxonomy
| System | Type | Consistency | Best for |
|---|---|---|---|
| Amazon S3 / GCS | Object store | Strong (since 2020) | Data lake, backups, static assets |
| HDFS | Distributed FS | Sequential | Large batch files (Spark, Hive) |
| Cassandra | Wide-column DB | Tunable (AP default) | Time-series, high write throughput |
| CockroachDB | Distributed SQL | Serializable | OLTP with geo-distribution |
| Google Spanner | Distributed SQL | External consistency | Global OLTP (finance, inventory) |
| Delta Lake / Iceberg | Table format (on S3) | ACID via OCC | Data lakehouse (batch + stream) |
LSM Trees vs B-Trees
Most distributed databases use one of these two storage engines under the hood:
# B-Tree: in-place updates, great for reads
# Pages (typically 4-16 KB) form a tree.
# Read: O(log n) page reads from root to leaf.
# Write: finds the leaf page, updates in-place. May require page split.
# Write amplification: ~10x (one logical write → ~10 disk writes with WAL + CoW)
# Good for: read-heavy OLTP, PostgreSQL, MySQL InnoDB
# LSM Tree (Log-Structured Merge): sequential writes, background compaction
# Writes go to in-memory memtable first (fast).
# Memtable flushed to immutable SSTable files (sorted, sequential I/O).
# Reads must check memtable + potentially many SSTables (bloom filters help).
# Background compaction merges SSTables — removes deleted keys, reduces file count.
# Write amplification: ~30x (multiple compaction rounds) — worse than B-tree
# Read amplification: worse without compaction, better with full compaction
# Good for: write-heavy workloads, Cassandra, RocksDB, LevelDB, HBase
# Practical implication for Cassandra:
# - Excellent for: IoT sensor writes, event logs, activity streams
# - Painful for: many small updates to same key (compaction pressure)
# - Use TTL to expire old data — keeps SSTables manageable
Choosing Storage for Data Pipelines
- Raw ingestion: S3/GCS (cheap, durable, scale-unlimited). Partition by date:
s3://bucket/events/year=2026/month=02/day=23/ - Intermediate processing: Delta Lake or Iceberg tables on S3 — ACID transactions, schema evolution, time travel.
- Serving layer: Redshift / BigQuery / Snowflake for analytics; CockroachDB / Spanner for real-time serving; Redis for hot data.
- Stream state: Kafka (retain all events), RocksDB (local state in Flink/Kafka Streams).
Caching in Distributed Systems
Cache Patterns
# Cache-Aside (Lazy Loading) — most common pattern
# Application manages the cache explicitly.
# Pro: only requested data is cached, resilient to cache failure.
# Con: cache miss penalty (3 trips: cache miss → DB read → cache write).
def get_user(user_id: str) -> dict:
cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
user = db.query("SELECT * FROM users WHERE id = %s", user_id)
redis.setex(f"user:{user_id}", 300, json.dumps(user)) # TTL 5 min
return user
# Write-Through — write to cache and DB synchronously
# Pro: cache always consistent with DB.
# Con: write latency doubles; caches data that may never be read.
def update_user(user_id: str, data: dict):
db.execute("UPDATE users SET ... WHERE id = %s", user_id)
redis.setex(f"user:{user_id}", 300, json.dumps(data))
# Write-Behind (Write-Back) — write to cache, async flush to DB
# Pro: very low write latency.
# Con: data loss if cache crashes before flush; complex implementation.
# Used by: database buffer pools (dirty pages), some write-heavy caches.
Cache Invalidation
Phil Karlton famously said there are only two hard problems in computer science: cache invalidation and naming things. The challenges:
- Stale reads: Cache entry is outdated. TTL mitigates but doesn't eliminate.
- Thundering herd (cache stampede): A popular key expires. Thousands of requests simultaneously miss the cache and hit the DB, overwhelming it.
- Cache inconsistency: DB updated but cache not invalidated (or vice versa). Race conditions in cache-aside are common.
# Cache stampede prevention: probabilistic early expiration
# (Optimal algorithm by Vattani et al. — no locking needed)
import math, random, time
def fetch_with_early_expiry(key: str, ttl: int, beta: float = 1.0):
"""
Recompute the cache entry slightly before it expires.
Higher beta = more aggressive early recomputation.
Distributes the recompute load over time instead of spiking at expiry.
"""
cached = redis.get(key)
if cached:
data, expiry, delta = json.loads(cached)
# Decide whether to recompute early
if time.time() - beta * delta * math.log(random.random()) < expiry:
return data
# Cache miss or early recompute decision
start = time.time()
data = compute_expensive_value(key)
delta = time.time() - start
expiry = time.time() + ttl
redis.setex(key, ttl, json.dumps([data, expiry, delta]))
return data
# Locking approach: only one request recomputes, others wait
import threading
_locks = {}
def fetch_with_lock(key: str, ttl: int):
cached = redis.get(key)
if cached:
return json.loads(cached)
lock = _locks.setdefault(key, threading.Lock())
with lock:
cached = redis.get(key) # double-check after acquiring lock
if cached:
return json.loads(cached)
data = compute_expensive_value(key)
redis.setex(key, ttl, json.dumps(data))
return data
Redis Cluster
# Redis Cluster: 16,384 hash slots distributed across nodes
# Keys map to slots: slot = CRC16(key) % 16384
# Each master node owns a range of slots (e.g., 0-5460, 5461-10922, 10923-16383)
# Each master has 1+ replicas.
# Create a 3-master, 3-replica cluster
redis-cli --cluster create \
127.0.0.1:7000 127.0.0.1:7001 127.0.0.1:7002 \
127.0.0.1:7003 127.0.0.1:7004 127.0.0.1:7005 \
--cluster-replicas 1
# Check cluster info
redis-cli -p 7000 cluster info
redis-cli -p 7000 cluster nodes
# Hash tags: {user_id}.profile and {user_id}.preferences
# map to the same slot (same node) — needed for multi-key ops
redis-cli -c -p 7000 set "{user_123}.profile" '{"name":"Alice"}'
redis-cli -c -p 7000 set "{user_123}.preferences" '{"theme":"dark"}'
Service Discovery & Load Balancing
Discovery Patterns
| Pattern | How it works | Examples |
|---|---|---|
| DNS-based | Service name resolves to VIP or pod IPs. K8s Services use this. | Kubernetes ClusterIP, AWS ALB, Route 53 |
| Client-side registry | Client queries registry for instance list, picks one, calls directly. | Eureka (Netflix OSS), Consul with client SDK |
| Server-side proxy | Client calls proxy/LB; proxy queries registry and routes. | Envoy, Nginx, AWS ALB, Istio |
How Kubernetes Service Discovery Works
# K8s Service: stable DNS name + VIP for a set of pods
apiVersion: v1
kind: Service
metadata:
name: payment-service
namespace: production
spec:
selector:
app: payment # routes to pods with this label
ports:
- port: 80
targetPort: 8080
type: ClusterIP # internal-only VIP
# DNS resolution inside the cluster:
# payment-service.production.svc.cluster.local → 10.96.43.21 (VIP)
# kube-proxy maintains iptables rules: VIP → random pod IP (DNAT)
# Headless Service (for stateful sets, direct pod addressing):
spec:
clusterIP: None # no VIP — DNS returns all pod IPs
# Useful for: Cassandra, Kafka, Redis Cluster (clients need all node IPs)
Load Balancing Algorithms
- Round-robin: Rotate through instances. Simple, ignores load.
- Least connections: Route to instance with fewest active requests. Better for variable-duration requests.
- Weighted round-robin: Higher-capacity nodes get proportionally more traffic.
- Random with two choices (Power of Two Choices): Pick two random instances, route to the less loaded. Near-optimal distribution with minimal coordination. Used by Nginx, Envoy.
- Consistent hash: Same client always hits same instance (sticky sessions without cookies). Useful for connection reuse, caching.
gRPC Load Balancing Challenge
HTTP/1.1 load balancers route per-request. gRPC uses HTTP/2 which multiplexes many requests over a single long-lived TCP connection. A naive L4 load balancer will route all gRPC traffic for one client to one backend, bypassing load distribution entirely.
Solutions: (1) L7 load balancer that understands HTTP/2 (Envoy, Nginx with grpc_pass), (2) client-side load balancing with service discovery (grpc-go supports this natively), (3) headless Kubernetes service + client-side DNS round-robin.
Observability in Distributed Systems
The Three Pillars
- Logs: Discrete events with context. Good for debugging specific requests. Poor for aggregation (slow, expensive at scale).
- Metrics: Numerical time-series aggregates. Cheap, fast, great for alerting and dashboards. No per-request detail.
- Traces: End-to-end journey of a single request through multiple services. The only way to understand latency distribution across service boundaries.
Distributed Tracing with OpenTelemetry
# OpenTelemetry Python — automatic + manual instrumentation
# pip install opentelemetry-sdk opentelemetry-exporter-otlp
# pip install opentelemetry-instrumentation-fastapi
# pip install opentelemetry-instrumentation-httpx
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
# Setup — typically in app startup
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://jaeger:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Auto-instrument FastAPI (injects trace headers on all requests)
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument() # auto-propagates headers on outbound calls
# Manual span for business operations
tracer = trace.get_tracer(__name__)
async def process_order(order_id: str):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("order.source", "web")
with tracer.start_as_current_span("validate_inventory"):
await inventory_service.check(order_id) # child span
with tracer.start_as_current_span("charge_payment") as payment_span:
try:
charge_id = await payment_service.charge(order_id)
payment_span.set_attribute("payment.charge_id", charge_id)
except Exception as e:
payment_span.record_exception(e)
payment_span.set_status(trace.Status(trace.StatusCode.ERROR))
raise
Correlation IDs & Context Propagation
# W3C Trace Context — standard HTTP headers for trace propagation
# traceparent: 00-{trace_id}-{span_id}-{flags}
# tracestate: vendor-specific state
# Manual propagation if not using auto-instrumentation:
from opentelemetry.propagate import inject, extract
from opentelemetry import context
# Outbound call — inject trace context into headers
def call_downstream(url: str):
headers = {}
inject(headers) # adds traceparent header
return httpx.get(url, headers=headers)
# Inbound call — extract and restore context
def handle_request(request_headers: dict):
ctx = extract(request_headers) # reads traceparent from headers
with trace.get_tracer(__name__).start_as_current_span(
"handle_request", context=ctx
):
pass # this span is now a child of the upstream span
# Correlation ID for logs (simpler than full tracing):
import uuid
CORRELATION_ID_KEY = "correlation_id"
def middleware(request, call_next):
correlation_id = request.headers.get("X-Correlation-ID", str(uuid.uuid4()))
# Store in context var, include in all log records
with log_context(correlation_id=correlation_id):
response = call_next(request)
response.headers["X-Correlation-ID"] = correlation_id
return response
Modern Data Engineering Patterns
Lambda vs Kappa Architecture
| Lambda | Kappa | |
|---|---|---|
| Batch layer | Yes — reprocesses all historical data periodically | No — stream only |
| Speed layer | Yes — real-time view of recent data | Yes — single stream pipeline |
| Serving layer | Merges batch + speed views | Single materialized view |
| Complexity | High — two code paths (batch & stream) to maintain | Lower — one code path |
| Reprocessing | Natural (batch layer rewrites) | Replay Kafka topic from start |
| Trend | Declining | Preferred for new systems |
Change Data Capture (CDC) with Debezium
CDC captures every row-level change in a database (insert/update/delete) and streams it as an event. The source database's transaction log is the authoritative event source — no application changes needed.
# CDC pipeline: Postgres → Debezium → Kafka → Data Warehouse
#
# [Postgres WAL] → [Debezium Connector] → [Kafka topic: postgres.public.orders]
# |
# [Kafka Connect Sink]
# |
# [Snowflake / BigQuery]
#
# Debezium reads Postgres WAL (write-ahead log) via logical replication.
# Each change becomes a Kafka message with before/after state.
# 1. Enable logical replication in Postgres
# postgresql.conf:
wal_level = logical
max_replication_slots = 4
max_wal_senders = 4
# 2. Debezium connector config (Kafka Connect REST API)
curl -X POST http://localhost:8083/connectors -H 'Content-Type: application/json' -d '{
"name": "postgres-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "debezium",
"database.password": "secret",
"database.dbname": "production",
"database.server.name": "prod",
"table.include.list": "public.orders,public.payments",
"plugin.name": "pgoutput",
"slot.name": "debezium_slot"
}
}'
# 3. Consume CDC events
kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--topic prod.public.orders \
--from-beginning
# Each message: {"before": {...}, "after": {...}, "op": "u", "ts_ms": 1708723200000}
# op: c=create, u=update, d=delete, r=read (snapshot)
Event Sourcing + CQRS
Event Sourcing: Instead of storing current state, store the sequence of events that produced it. State is derived by replaying events.
- Audit trail: Perfect, complete history by design.
- Time travel: Replay to any point in time.
- Event-driven: Other services react to events.
- Complexity: Schema evolution is hard — old events must remain interpretable.
CQRS (Command Query Responsibility Segregation): Separate the write model (commands that change state) from the read model (queries that read materialized views). Naturally pairs with event sourcing: commands produce events; read models are projections built from events.
# CQRS + Event Sourcing pattern
from dataclasses import dataclass
from datetime import datetime
from typing import List
@dataclass
class OrderCreated:
order_id: str
user_id: str
items: list
created_at: datetime
@dataclass
class PaymentProcessed:
order_id: str
charge_id: str
amount: float
processed_at: datetime
@dataclass
class OrderShipped:
order_id: str
tracking_number: str
shipped_at: datetime
class Order:
"""Write model — derived from events."""
def __init__(self):
self.status = None
self.items = []
self.charge_id = None
@classmethod
def from_events(cls, events: List) -> 'Order':
order = cls()
for event in events:
order.apply(event)
return order
def apply(self, event):
if isinstance(event, OrderCreated):
self.status = "pending"
self.items = event.items
elif isinstance(event, PaymentProcessed):
self.status = "paid"
self.charge_id = event.charge_id
elif isinstance(event, OrderShipped):
self.status = "shipped"
# Read model projection (separate, denormalized, optimized for queries)
class OrderSummaryProjection:
"""Rebuilt by replaying all events from the event store."""
def handle(self, event):
if isinstance(event, OrderCreated):
db.upsert("order_summaries", {
"order_id": event.order_id,
"user_id": event.user_id,
"status": "pending",
"item_count": len(event.items)
})
elif isinstance(event, PaymentProcessed):
db.update("order_summaries",
{"status": "paid", "amount": event.amount},
where={"order_id": event.order_id})
Data Mesh
Data mesh (Zhamak Dehghani, 2019) is an organizational and architectural approach treating data as a product. Four principles:
- Domain-oriented ownership: The team that generates the data owns and serves it. No central data engineering team bottleneck.
- Data as a product: Each domain exposes discoverable, addressable, trustworthy, and self-describing data products.
- Self-serve data platform: Infrastructure abstractions so domain teams can build/deploy data products without platform expertise.
- Federated computational governance: Global policies (security, privacy) enforced across all products, implemented locally.
System Design Patterns Summary
Quick-reference for design interviews and architecture decisions:
| Pattern | Problem it solves | Key trade-off | When to reach for it |
|---|---|---|---|
| Sharding | Data too large for one node | Cross-shard queries become expensive | Table > ~1TB, write throughput saturated |
| Replication | Fault tolerance + read scaling | Consistency vs availability | Always — every production DB should have replicas |
| Consensus (Raft) | Nodes agree on a single value | Latency — requires quorum round-trip | Leader election, config store, distributed lock |
| Message Queue | Decouple producers from consumers, buffer load spikes | At-least-once delivery requires idempotency | Async workflows, fan-out, load leveling |
| Cache-Aside | Reduce DB load, lower latency | Cache invalidation, cold start | Hot read data with acceptable staleness |
| Saga | Distributed transaction across services | No isolation between steps — partial failures visible | Multi-service workflows needing rollback |
| CQRS | Read/write models differ significantly | Complexity, eventual consistency between models | High-traffic reads with complex denormalization needs |
| Event Sourcing | Need full audit trail, time travel | Schema evolution complexity, query difficulty | Financial ledgers, audit-critical domains |
| Circuit Breaker | Prevent cascading failures | False positives under transient load | Any synchronous service-to-service call |
| Bulkhead | Isolate failure domains | Resource overhead (separate thread pools) | Critical services must not be starved by non-critical |
| Sidecar | Add cross-cutting concerns (tracing, auth) without changing app code | Extra network hop, operational complexity | Service mesh (Envoy), secret injection |
| CDC (Debezium) | Sync DB state to other systems without polling | Requires WAL access, schema coupling | Real-time data pipelines, cache invalidation at scale |
Interview Quick-Reference
How to answer "design a distributed key-value store"
- Requirements: Read/write ratio? Consistency requirement (linearizable or eventual)? Scale (data size, QPS)?
- Partitioning: Consistent hashing, virtual nodes. Partition key = raw key hash.
- Replication: Leaderless (Dynamo-style) for availability, OR single-leader per partition for stronger consistency. Replication factor = 3.
- Consistency: Quorum reads/writes. W + R > N. Tunable per request.
- Conflict resolution: LWW with server-side timestamps (simple), or vector clocks (correct).
- Failure detection: Gossip protocol for membership, phi-accrual for failure detection.
- Storage: LSM tree (RocksDB) for high write throughput.
How to answer "design a URL shortener at scale"
- Write path: Generate short code (base62 of a counter from a distributed ID generator — Snowflake), store in DB partitioned by short code.
- Read path: Dominant traffic. Cache-aside with Redis (TTL 24h). Cache hit rate > 99%.
- DB: Cassandra (hash-partitioned by short code, high read throughput) or Postgres with read replicas.
- Replication: Multi-region read replicas for geographic latency.
- Rate limiting: Token bucket per user_id stored in Redis.
- Analytics: Click events → Kafka → Spark/Flink → warehouse. Not in the hot path.
Consistency model cheat sheet for interviews
| If the interviewer asks about... | Answer |
|---|---|
| "User posts a comment and immediately reloads" | Read-your-writes guarantee (sticky to primary or use causal consistency) |
| "Two users update the same record simultaneously" | Optimistic concurrency (version numbers), or serializable transactions |
| "Counter increments (likes, views)" | Eventual consistency + idempotent deduplication (exact count not needed) |
| "Bank account balance" | Serializable (or strict serializable). No shortcuts. |
| "Shopping cart" | AP (eventual). Users can add items offline; merge on sync. Amazon Dynamo paper. |
| "Leader election" | Consensus (Raft/etcd). Must be CP. |