Table of Contents

1. Data Contracts & Schema Governance

Schema Registry enforces structural compatibility. It does not protect you against the class of breakage that actually causes the most production incidents.

The Three Failure Modes

ModeExampleCaught by Schema Registry?
Field rename user_iduserId Yes — breaks backward compatibility check
Field drop Remove amount from payload Yes — breaks backward compatibility check
Field repurpose amount switches from dollars to cents; status gains 40 new enum values No. Schema is still structurally valid. Downstream gets silently wrong numbers.
The hardest breaks aren't structural — they're semantic
Semantic drift is the one most likely to corrupt production analytics quietly for days before anyone notices. A dashboard showing revenue 100x off because someone changed a units convention is harder to catch than a connector throwing a deserialization exception.

The Four-Layer Defense

Layer 1: Schema Registry with FULL compatibility

Use FULL compatibility mode — not BACKWARD alone. FULL means new schemas must be backward-compatible (old consumers can read new messages) AND forward-compatible (new consumers can read old messages). It catches renames and drops at producer deploy time, before a single message is published.

Use Avro or Protobuf — not raw JSON. JSON has no enforced schema at the wire level. If you must ingest third-party JSON, add a validation layer that routes non-conforming records to a dead-letter topic.

# Check compatibility before registering a new schema version
# Run this in producer CI — fail the build here, not in production
curl -X POST http://schema-registry:8081/compatibility/subjects/order-events-value/versions/latest \
  -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  -d '{"schema": "..."}'
# {"is_compatible": false}  ← build fails, upstream team is notified

Layer 2: Consumer-Driven Contracts

Each downstream team declares which fields they depend on and what invariants they expect. Declarations live in a shared repo. The upstream team's CI validates against all registered consumer contracts before deploying. A breaking change fails the producer's pipeline — not the consumer's pipeline in production at 2am.

# contracts/analytics-platform.yaml
consumer: analytics-platform
depends_on:
  - topic: order-events
    fields:
      - name: order_id
        type: string
        required: true
      - name: total_usd
        type: number
        required: true
        constraints: { min: 0, max: 1000000 }
      - name: status
        type: string
        allowed_values: [pending, confirmed, shipped, cancelled]
        # Any new status value must be announced before adding to producer

Layer 3: Statistical Distribution Monitoring

The only defense against semantic drift. Track value distributions over time: cardinality, null rate, min/max, value frequency. Use Population Stability Index (PSI) to detect when a distribution shifts. A PSI > 0.2 on a field signals a meaningful change — investigate before your dashboards notice.

# Daily field profile check — run as scheduled Spark job
from pyspark.sql import functions as F

def field_profile(df, field: str) -> dict:
    return df.agg(
        F.count(field).alias("count"),
        F.countDistinct(field).alias("cardinality"),
        F.sum(F.when(F.col(field).isNull(), 1).otherwise(0)).alias("null_count"),
        F.min(field).alias("min_val"),
        F.max(field).alias("max_val"),
    ).collect()[0].asDict()

yesterday = spark.read.parquet("s3://lake/order-events/date=2024-01-14/")
today     = spark.read.parquet("s3://lake/order-events/date=2024-01-15/")

for field in ["total_usd", "status", "customer_id"]:
    prev = field_profile(yesterday, field)
    curr = field_profile(today, field)

    cardinality_ratio = abs(curr["cardinality"] - prev["cardinality"]) / max(prev["cardinality"], 1)
    null_delta = abs(curr["null_count"] / max(curr["count"], 1) - prev["null_count"] / max(prev["count"], 1))

    if cardinality_ratio > 0.5:
        alert(f"WARN: {field} cardinality changed {prev['cardinality']} -> {curr['cardinality']}")
    if null_delta > 0.05:
        alert(f"WARN: {field} null rate shifted by {null_delta:.1%}")

Layer 4: Organizational Policy

Technology alone is not enough. The organizational layer is what actually prevents incidents: a 4-week deprecation window before any field removal, a dedicated #data-schema-changes Slack channel with mandatory announcements, declared topic ownership in the schema registry, and a dependency registry so producers can estimate blast radius before making any change.

2. Kafka in Production

Self-Managed vs MSK: How to Decide

ChoiceBest FitThe Real Trade-off
Self-managed (EC2 + Docker) Cost-sensitive at scale (>10 TB/month), need custom tuning, strong infra team You own upgrades, disk management, and 3am pages
Amazon MSK AWS-native teams, want managed broker lifecycle Less config flexibility; storage auto-scaling has lag — don't rely on it alone
Confluent Cloud Schema Registry + ksqlDB + Connect as managed services, time-to-value matters Most expensive; vendor lock-in on Confluent-specific features
Kubernetes (Strimzi) Existing K8s infra, GitOps-driven config Persistent volume management on K8s is genuinely painful; network complexity

EBS Volume Management

Kafka is I/O-bound. The disk is the bottleneck in almost every production issue. Get this right upfront.

War Story: The Disk-Full Cascade

A high-volume topic on broker 3 hit 92% disk. The broker stopped accepting writes. Producers, configured with retries, immediately began retrying to brokers 1 and 2. Within 4 minutes, brokers 1 and 2 hit their own I/O saturation limits — not disk, but IOPS — and began experiencing increased latency. Consumer lag on all topics spiked. The monitoring alert for "disk > 75%" had been firing for 6 hours but was in a silenced maintenance window that nobody had cleared.

Fix: Reduced retention on the offending topic to 1 day (from 7), which freed 180 GB immediately. Then expanded the EBS volume online (xfs_growfs requires no downtime). Added an explicit alert that fires even during maintenance windows for any disk > 90%.

# Immediate response when disk is filling fast:

# 1. Reduce retention on high-volume topics (buys time, takes effect within minutes)
kafka-configs.sh --bootstrap-server broker1:9092 \
  --entity-type topics --entity-name high-volume-events \
  --alter --add-config retention.ms=86400000    # drop to 1 day

# 2. Expand EBS volume online — no downtime required on XFS
aws ec2 modify-volume --volume-id vol-abc123 --size 1000
sudo xfs_growfs /data/kafka    # resize the live filesystem

# 3. Check which topics are consuming the most space
du -sh /data/kafka/*/  | sort -rh | head -20

Rolling Upgrades

Never restart all brokers simultaneously. The procedure: stop one broker, wait for all under-replicated partitions to reach zero, then restart and move to the next. With min.insync.replicas=2 and replication.factor=3, the cluster tolerates one broker down with no data loss and no write stall.

for BROKER_HOST in broker1 broker2 broker3; do
  ssh $BROKER_HOST "docker compose stop kafka"

  echo "Waiting for under-replicated partitions to clear..."
  while true; do
    URP=$(kafka-topics.sh --bootstrap-server broker1:9092 \
      --describe --under-replicated-partitions 2>/dev/null | wc -l)
    [ "$URP" -eq 0 ] && break
    echo "  $URP URP remaining..." && sleep 10
  done

  ssh $BROKER_HOST "docker compose up -d kafka"
  sleep 30    # wait for broker to rejoin ISR
done

# Rebalance leadership after all restarts
kafka-leader-election.sh --bootstrap-server broker1:9092 \
  --election-type PREFERRED --all-topic-partitions

Consumer Lag: Monitor the Trend, Not the Number

An absolute lag of 50,000 messages is fine for a batch job processing daily snapshots. The same lag on a payment processing consumer is a crisis. What matters is whether lag is growing. A lag that stays flat at 10,000 is stable. A lag that grew from 100 to 1,000 over 15 minutes needs investigation now.

# Quick CLI check
kafka-consumer-groups.sh --bootstrap-server broker1:9092 \
  --describe --group my-consumer-group
# Focus on: is LAG stable? or growing?

# Alert rule (Prometheus/Grafana)
# Alert when: rate of change of consumer lag > 0 over 15-minute window
# Not: when absolute lag > N (this threshold is workload-dependent)

Debezium: The Snapshot Bottleneck

When Debezium starts on a new table — or when you add a table to an existing connector — it runs a full SELECT * snapshot. On a 500 GB table, this runs for hours and floods the cluster. The default Kafka Connect producer settings are tuned for low-latency streaming, not bulk throughput. You must override them for the snapshot phase.

Things that will bite you with Debezium on Postgres
  • WAL retention from abandoned slots: An idle Debezium connector holds a replication slot open. Postgres retains WAL indefinitely for that slot. If the slot lags far enough, it can fill the disk and make Postgres go read-only. Set max_slot_wal_keep_size = '50GB' as a safety cap and monitor pg_replication_slots.
  • TOAST columns show as NULL on UPDATE: Postgres doesn't include unchanged TOAST (large text/jsonb) columns in WAL. Set REPLICA IDENTITY FULL on tables with TOAST columns you need in your CDC stream.
  • Always 1 task: Debezium Postgres reads a single WAL stream. tasks.max must be 1. Setting it higher does nothing except confuse you.
  • Run the snapshot against a replica: The initial SELECT * can saturate the primary's I/O. Point Debezium at a read replica for the snapshot phase.
# Monitor replication slot health — run this daily
psql -c "
  SELECT slot_name,
         active,
         pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS wal_lag
  FROM pg_replication_slots;
"
# wal_lag should stay well under 1 GB. If it's growing, Debezium is lagging or dead.

MirrorMaker 2: Cross-Team Isolation

MM2's most underrated use case is not disaster recovery — it's team isolation. When Team B needs a subset of Team A's topics, direct cross-cluster consumers create tight operational coupling. A misbehaving consumer in Team B's service can affect Team A's cluster performance. MM2 mirrors specific topics into Team B's own cluster. Team B consumers talk only to their own cluster. The blast radius of Team B's incidents stays within Team B.

Topic naming: do not use IdentityReplicationPolicy in bidirectional setups
The default policy prefixes mirrored topics with the source cluster alias (source.team-a.events). This prevents infinite replication loops. IdentityReplicationPolicy removes the prefix but causes infinite loops in any active-active configuration. Only use it for one-way migrations where you're certain.

3. Spark in Production

Platform Decision Matrix

PlatformBest ForAvoid If
EMR (AWS) S3-heavy pipelines, AWS-native teams, spot instances for 60-80% cost savings Jobs need <5 min startup; many small frequent jobs (spin-up cost dominates)
Databricks Fastest time-to-value, Delta Lake, Unity Catalog, notebook-first teams Tight budget; need full infra control; already have a Spark cluster
Kubernetes Multi-cloud, containerized workloads, existing K8s infra Small team without dedicated K8s expertise; PV management will cost you time
YARN on-prem Existing Hadoop cluster, data that can't leave your data center Starting fresh — Hadoop's operational burden is high for the marginal control gained

Packaging Pitfalls

The most common reason a Spark job works locally and fails on the cluster is a packaging problem.

Never bundle Spark's own JARs in your uber-jar
Declare spark-sql, spark-core, etc. as <scope>provided</scope> in Maven (or compileOnly in Gradle). The cluster already has them. Bundling them causes classloader conflicts that produce cryptic errors at runtime, not at build time.

The Small Files Problem

Every Spark task writes one output file per partition. A job with 1,000 partitions that each write 512 KB produces 1,000 files totaling 500 MB. The next job reading that output spawns 1,000 tasks for 500 MB of data — the task scheduling overhead dominates actual work. This compounds over time: a daily job creating 1,000 small files means 365,000 files after a year.

# Coalesce before write — fewer output files, no full shuffle
# Target: 128-512 MB per output file
df.coalesce(20).write.mode("overwrite").parquet("s3://bucket/output/")

# Do NOT use repartition() here — it does a full shuffle just to reduce file count
# coalesce() achieves the same result by merging partitions on existing tasks

# For Delta Lake / Iceberg: run OPTIMIZE periodically instead of coalescing
# OPTIMIZE rewrites small files into larger ones without reprocessing the data
# spark.sql("OPTIMIZE delta.`s3://bucket/output/`")
Adaptive Query Execution handles this automatically in Spark 3.x
Enable spark.sql.adaptive.coalescePartitions.enabled=true and AQE will merge small shuffle partitions at runtime. For file output, you still need to coalesce explicitly — AQE does not control the number of output files.

Broadcast Join for Small Dimension Tables

A sort-merge join on a large fact table against a small dimension table (country codes, product catalog, user segments) shuffles the large table across the network unnecessarily. If the dimension table fits in memory (under 100–200 MB), broadcast it. Each executor gets a local copy; no shuffle needed on the fact table side.

from pyspark.sql.functions import broadcast

# Force a broadcast join for the small dimension table
result = df_orders.join(broadcast(df_country_codes), "country_id")

# Raise the auto-broadcast threshold if your dimension tables are a bit larger
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 100 * 1024 * 1024)  # 100 MB

# SQL hint syntax
spark.sql("""
    SELECT /*+ BROADCAST(c) */ o.*, c.country_name
    FROM orders o JOIN country_codes c ON o.country_id = c.id
""")

Memory Tuning

OOM errors in Spark almost always have one of three causes: too many partitions per executor, a broadcast variable that's too large for the driver, or a collect() that pulls everything to the driver JVM.

ParameterRule of ThumbCommon Mistake
spark.executor.memory 4–18 GB. 5 cores per executor is a sweet spot for parallelism vs overhead. Setting it too high leaves no memory for the OS and causes container eviction
spark.executor.memoryOverhead 10% of executor memory, min 384 MB. Increase for PySpark (Python process runs outside JVM) Ignoring it entirely. PySpark OOMs are often overhead exhaustion, not JVM heap
spark.driver.memory 4–8 GB. Only needs to be large if you collect() data or broadcast large tables Setting driver memory the same as executor memory "just to be safe" — wasteful

4. Airflow in Production

DAG Design Patterns That Hold Up

Anti-Pattern: The Data Request Team

In a previous platform, the data engineering team manually fielded every analytics request by writing a new DAG. Within 6 months, the team had 400+ DAGs with half owned by people who had left. Backfill requests took 2 weeks because someone had to track down the DAG author. The scheduler was overloaded with zombie DAGs that nobody dared delete.

Lesson: Build self-service data access early — parameterized DAG templates, a simple UI to trigger backfills, and a clear ownership model. Every DAG needs an owner in the code. Ownerless DAGs should not be schedulable.

Alerting: SLA Miss vs Task Failure

These are different failure modes requiring different responses. A task failure means something is broken — the code, an upstream API, a database connection. An SLA miss means the pipeline ran, possibly succeeded, but took longer than expected. Conflating them means your on-call team gets paged for SLA misses on low-priority reporting DAGs and ignores them, then misses a real task failure on a critical pipeline.

Backfill Strategy

Design for backfill from day one. When a bug is fixed or a new field is added, you will need to re-process historical data. If your DAG reads from Kafka and the data is no longer retained, backfill is impossible. If your task hardcodes TODAY instead of the execution date, every historical run produces today's data.

# Trigger a backfill for a date range
airflow dags backfill --start-date 2024-01-01 --end-date 2024-01-31 my_pipeline_dag

# Backfill runs are marked as such — they don't trigger downstream sensors
# Run with --dry-run first to see what will execute
airflow dags backfill --start-date 2024-01-01 --end-date 2024-01-31 --dry-run my_pipeline_dag

# If your tasks are idempotent, backfill is safe to run anytime
# If they're not — fix them before you need to backfill under pressure

5. Data Quality & Observability

Nobody knows data is bad until a business user spots a wrong number in a dashboard. By then, the problem is usually days old. Build observability into the pipeline, not as an afterthought.

The Minimum Viable Check Set

CheckWhat It CatchesAlert Threshold
Freshness Source data is stale — upstream job failed silently or stopped producing Latest record timestamp > expected interval + buffer
Row count Complete data loss, partial load, truncation, dedup failure Today's count < 80% or > 200% of 7-day average
Null rate Required field going null — upstream schema change, ETL bug Null rate for required field > 1%
Referential integrity Orphaned records — order with no matching customer Unmatched rate > 0.1% (tune to your data's real orphan rate)
Distribution drift (PSI) Semantic changes, unit switches, enum expansion PSI > 0.2 on key fields
Monitor the monitor
Your data quality checks are themselves a pipeline. They can fail silently — a check job that errors out without alerting is worse than no check at all, because it gives false confidence. Track whether each check ran as a separate metric. Alert if a check hasn't produced a result in its expected window.
-- Freshness check: has the source updated in the last 2 hours?
SELECT
  MAX(event_time) AS latest_event,
  CASE
    WHEN MAX(event_time) < NOW() - INTERVAL '2 hours' THEN 'STALE'
    ELSE 'OK'
  END AS freshness_status
FROM order_events;

-- Row count check: compare today vs 7-day average
WITH daily_counts AS (
  SELECT DATE(event_time) AS dt, COUNT(*) AS row_count
  FROM order_events
  WHERE event_time >= NOW() - INTERVAL '8 days'
  GROUP BY 1
),
baseline AS (
  SELECT AVG(row_count) AS avg_count
  FROM daily_counts WHERE dt < CURRENT_DATE
)
SELECT
  d.dt,
  d.row_count,
  b.avg_count,
  d.row_count / NULLIF(b.avg_count, 0) AS ratio,
  CASE
    WHEN d.row_count / NULLIF(b.avg_count, 0) < 0.8 THEN 'LOW'
    WHEN d.row_count / NULLIF(b.avg_count, 0) > 2.0 THEN 'HIGH'
    ELSE 'OK'
  END AS status
FROM daily_counts d, baseline b
WHERE d.dt = CURRENT_DATE;

6. Storage & Table Formats

The Raw Parquet Trap

Starting with raw Parquet files seems simple. After 12 months of production, you have these problems:

Adopt Iceberg or Delta Lake from day one
These are not premature optimization — they solve problems you will encounter within the first year of production. The migration cost from raw Parquet to Iceberg is significant. The migration cost from day one is zero.

Partition Strategy

Partition on fields that appear in WHERE clauses of your most common queries. Date is almost always one of them. For multi-tenant systems, date + tenant_id lets you efficiently answer "show me all events for tenant X in January" without scanning all tenants.

Compaction

Streaming jobs write many small files — one per micro-batch per partition. Without a compaction job, read performance degrades over time as the file count grows. Run a weekly or daily compaction job that rewrites small files into larger ones (target: 128–512 MB per file).

-- Delta Lake: OPTIMIZE rewrites small files, Z-ORDER clusters related data
OPTIMIZE delta.`s3://bucket/events/`
ZORDER BY (user_id, event_type);

-- Iceberg: rewrite data files smaller than 128 MB
CALL catalog.system.rewrite_data_files(
  table => 'db.events',
  options => map('target-file-size-bytes', '134217728',
                 'min-file-size-bytes', '33554432')
);

7. Operational Playbooks

These are the incidents that repeat. Having a written playbook — even a short one — cuts mean time to resolution by at least 50%.

Kafka Broker Disk Full P0
Symptom Producers returning NOT_ENOUGH_REPLICAS or LEADER_NOT_AVAILABLE. Broker logs show disk write errors. Consumer lag spiking on all topics.
Diagnosis SSH to the broker. Run df -h /data/kafka. Check which topics are largest: du -sh /data/kafka/*/ | sort -rh | head -10.
Fix Immediately reduce retention on the largest topic: kafka-configs.sh --alter --add-config retention.ms=86400000. Wait 5–10 minutes for log cleanup to run. Then expand EBS: aws ec2 modify-volume --size NEW_SIZE + sudo xfs_growfs /data/kafka.
Prevention Alert at 75%. Never silence disk alerts. Size volumes for daily_ingest_GB × retention_days × replication_factor × 1.3. Use tiered storage (Kafka 3.6+) to offload cold segments to S3.
Debezium Replication Slot Growing P1
Symptom Postgres WAL directory growing. pg_replication_slots shows large lag. Possible Postgres disk full alert.
Diagnosis SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) FROM pg_replication_slots;. Check Kafka Connect status: curl http://connect:8083/connectors/postgres-cdc/status | jq .
Fix If connector is running but lagging: scale up Connect workers, check downstream Kafka topic consumer lag. If connector is stopped/failed: restart it. If slot is stale and safe to drop: SELECT pg_drop_replication_slot('slot_name'); — connector will re-snapshot on restart.
Prevention Set max_slot_wal_keep_size = '50GB' in Postgres config. Alert on pg_replication_slots lag > 10 GB. Never let a Debezium connector sit stopped for more than a few hours on a high-write database.
Spark Job OOM P1
Symptom Executor lost with ExecutorLostFailure (executor X exited caused by one of the running tasks). YARN or K8s killed the container. Stage keeps retrying.
Diagnosis Open Spark UI, go to the failing stage, click a failed task. Check GC time — if it's >20% of task time, heap is thrashing. Check if it's a shuffle read stage (data skew) or a broadcast stage (broadcast too large).
Fix For heap OOM: increase spark.executor.memory or reduce spark.executor.cores (fewer concurrent tasks per executor = less heap pressure). For PySpark OOM: increase spark.executor.memoryOverhead. For skew OOM: salt the join key or use AQE skew join handling.
Prevention Enable AQE. Profile with the Spark UI before tuning blindly. Test with production-scale data — staging datasets rarely expose skew or memory issues.
Airflow DAG Stuck in Running P1
Symptom DAG run shows "running" for far longer than normal. No tasks failing, but no tasks completing either. Downstream DAGs waiting on a sensor are also stuck.
Diagnosis Check the scheduler logs for the stuck task. Check worker logs on the Celery/K8s executor. Common causes: zombie task (worker died, task still marked running), deadlocked sensor (polling a condition that will never be true), or a resource lock (slot limit reached, all slots taken by stuck tasks).
Fix For zombie tasks: mark the task as failed in the Airflow UI, then clear and re-run. For deadlocked sensors: check the condition the sensor is polling — if the upstream data will never arrive, kill the sensor and investigate the upstream pipeline. For slot deadlocks: increase the pool size or clear stuck tasks to free slots.
Prevention Set execution_timeout on every task. Sensors should have timeout and mode='reschedule' (frees the worker slot while waiting). Set dagrun_timeout to kill entire DAG runs that exceed a maximum wall clock time.

8. Meta-Lessons

These are the things you only learn by shipping and running data systems in production.

Start Simple. Add Complexity When the Data Tells You To.

Every Kafka cluster that started as MSK and ended as a self-managed Strimzi cluster on K8s did so because someone estimated the scale wrong upfront. Start with the managed service. When you have actual throughput numbers, cost data, and specific limitations you've hit, then make the case for self-managing. Premature infrastructure optimization is the same mistake as premature code optimization — just more expensive.

Operational Experience Is a Superpower.

Most engineers who work with data systems have read the documentation. Few have operated them under pressure — disk full at 2am, cascade failures, trying to explain to a VP why the dashboard is wrong. The engineers who have been through production incidents systematically design better systems. They put the 75% disk alert in, they write the playbook, they design the schema to be evolvable. Operational experience is not separate from engineering skill — it is engineering skill.

Data Modeling Is the Highest-Leverage Work.

You can replace Spark with Trino, swap Kafka for Kinesis, and migrate from Parquet to Iceberg. What you cannot easily change is the data model. A poorly modeled fact table that's been in production for two years, with 30 downstream consumers, costs orders of magnitude more to fix than the same mistake caught at design time. Spend disproportionate time on the data model. Get buy-in from upstream producers and downstream consumers before writing a line of pipeline code.

The Best Architecture Is the One Your Team Can Operate.

A sophisticated Lambda architecture with exactly-once Kafka streams, a Flink stateful processor, and an Iceberg lakehouse with row-level deletes is impressive on paper. If your team has two data engineers and neither has operated Flink before, it will be a production incident waiting to happen. Choose the simplest architecture that meets your SLA. Add sophistication only when the simpler approach demonstrably fails to meet requirements.

Invest in Data Contracts Early.

Breaking schema changes cause more production incidents than infrastructure failures. Infrastructure fails loudly — you get an alert, you page someone, you fix it. Schema drift fails silently — your KPI dashboard shows numbers that are 100x off and nobody notices for a week. The cost of setting up consumer-driven contracts and distribution monitoring in month 1 is small. The cost of retrofitting data quality practices after your data has 20 upstream producers and 50 downstream consumers is enormous.

The progression every data platform goes through
  1. Ship raw Parquet to S3, everything works fine
  2. Data quality issues surface, add basic checks
  3. Schema changes break consumers, add Schema Registry
  4. Small files accumulate, add compaction
  5. Need row-level deletes, migrate to Iceberg
  6. Semantic drift causes dashboard errors, add distribution monitoring

You will go through these steps. The question is whether you do each one reactively (after an incident) or proactively (before it costs you). This page exists so you can do them proactively.