Production Insights
War stories & hard-won lessons — things no tutorial teaches you, learned from operating Kafka, Spark, Airflow, and data platforms at scale
Table of Contents
1. Data Contracts & Schema Governance
Schema Registry enforces structural compatibility. It does not protect you against the class of breakage that actually causes the most production incidents.
The Three Failure Modes
| Mode | Example | Caught by Schema Registry? |
|---|---|---|
| Field rename | user_id → userId |
Yes — breaks backward compatibility check |
| Field drop | Remove amount from payload |
Yes — breaks backward compatibility check |
| Field repurpose | amount switches from dollars to cents; status gains 40 new enum values |
No. Schema is still structurally valid. Downstream gets silently wrong numbers. |
The Four-Layer Defense
Layer 1: Schema Registry with FULL compatibility
Use FULL compatibility mode — not BACKWARD alone. FULL means new schemas must be backward-compatible (old consumers can read new messages) AND forward-compatible (new consumers can read old messages). It catches renames and drops at producer deploy time, before a single message is published.
Use Avro or Protobuf — not raw JSON. JSON has no enforced schema at the wire level. If you must ingest third-party JSON, add a validation layer that routes non-conforming records to a dead-letter topic.
# Check compatibility before registering a new schema version
# Run this in producer CI — fail the build here, not in production
curl -X POST http://schema-registry:8081/compatibility/subjects/order-events-value/versions/latest \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"schema": "..."}'
# {"is_compatible": false} ← build fails, upstream team is notified
Layer 2: Consumer-Driven Contracts
Each downstream team declares which fields they depend on and what invariants they expect. Declarations live in a shared repo. The upstream team's CI validates against all registered consumer contracts before deploying. A breaking change fails the producer's pipeline — not the consumer's pipeline in production at 2am.
# contracts/analytics-platform.yaml
consumer: analytics-platform
depends_on:
- topic: order-events
fields:
- name: order_id
type: string
required: true
- name: total_usd
type: number
required: true
constraints: { min: 0, max: 1000000 }
- name: status
type: string
allowed_values: [pending, confirmed, shipped, cancelled]
# Any new status value must be announced before adding to producer
Layer 3: Statistical Distribution Monitoring
The only defense against semantic drift. Track value distributions over time: cardinality, null rate, min/max, value frequency. Use Population Stability Index (PSI) to detect when a distribution shifts. A PSI > 0.2 on a field signals a meaningful change — investigate before your dashboards notice.
# Daily field profile check — run as scheduled Spark job
from pyspark.sql import functions as F
def field_profile(df, field: str) -> dict:
return df.agg(
F.count(field).alias("count"),
F.countDistinct(field).alias("cardinality"),
F.sum(F.when(F.col(field).isNull(), 1).otherwise(0)).alias("null_count"),
F.min(field).alias("min_val"),
F.max(field).alias("max_val"),
).collect()[0].asDict()
yesterday = spark.read.parquet("s3://lake/order-events/date=2024-01-14/")
today = spark.read.parquet("s3://lake/order-events/date=2024-01-15/")
for field in ["total_usd", "status", "customer_id"]:
prev = field_profile(yesterday, field)
curr = field_profile(today, field)
cardinality_ratio = abs(curr["cardinality"] - prev["cardinality"]) / max(prev["cardinality"], 1)
null_delta = abs(curr["null_count"] / max(curr["count"], 1) - prev["null_count"] / max(prev["count"], 1))
if cardinality_ratio > 0.5:
alert(f"WARN: {field} cardinality changed {prev['cardinality']} -> {curr['cardinality']}")
if null_delta > 0.05:
alert(f"WARN: {field} null rate shifted by {null_delta:.1%}")
Layer 4: Organizational Policy
Technology alone is not enough. The organizational layer is what actually prevents incidents: a 4-week deprecation window before any field removal, a dedicated #data-schema-changes Slack channel with mandatory announcements, declared topic ownership in the schema registry, and a dependency registry so producers can estimate blast radius before making any change.
2. Kafka in Production
Self-Managed vs MSK: How to Decide
| Choice | Best Fit | The Real Trade-off |
|---|---|---|
| Self-managed (EC2 + Docker) | Cost-sensitive at scale (>10 TB/month), need custom tuning, strong infra team | You own upgrades, disk management, and 3am pages |
| Amazon MSK | AWS-native teams, want managed broker lifecycle | Less config flexibility; storage auto-scaling has lag — don't rely on it alone |
| Confluent Cloud | Schema Registry + ksqlDB + Connect as managed services, time-to-value matters | Most expensive; vendor lock-in on Confluent-specific features |
| Kubernetes (Strimzi) | Existing K8s infra, GitOps-driven config | Persistent volume management on K8s is genuinely painful; network complexity |
EBS Volume Management
Kafka is I/O-bound. The disk is the bottleneck in almost every production issue. Get this right upfront.
- Volume type: gp3 — 3,000 IOPS and 125 MB/s baseline included in price. Only upgrade to io2 if you're saturating gp3 throughput.
- Filesystem: XFS. Better than ext4 for Kafka's sequential write + concurrent read pattern.
- Mount options:
noatime,nodiratime— eliminates unnecessary metadata writes on every read. - Alert threshold: Page at 75%. Take action before 90%. Do not wait for "it's fine, we have headroom."
- Multiple volumes: Use
log.dirs=/data1,/data2— striping across EBS volumes increases throughput and the broker survives a single volume failure (JBOD mode).
A high-volume topic on broker 3 hit 92% disk. The broker stopped accepting writes. Producers, configured with retries, immediately began retrying to brokers 1 and 2. Within 4 minutes, brokers 1 and 2 hit their own I/O saturation limits — not disk, but IOPS — and began experiencing increased latency. Consumer lag on all topics spiked. The monitoring alert for "disk > 75%" had been firing for 6 hours but was in a silenced maintenance window that nobody had cleared.
Fix: Reduced retention on the offending topic to 1 day (from 7), which freed 180 GB immediately. Then expanded the EBS volume online (xfs_growfs requires no downtime). Added an explicit alert that fires even during maintenance windows for any disk > 90%.
# Immediate response when disk is filling fast:
# 1. Reduce retention on high-volume topics (buys time, takes effect within minutes)
kafka-configs.sh --bootstrap-server broker1:9092 \
--entity-type topics --entity-name high-volume-events \
--alter --add-config retention.ms=86400000 # drop to 1 day
# 2. Expand EBS volume online — no downtime required on XFS
aws ec2 modify-volume --volume-id vol-abc123 --size 1000
sudo xfs_growfs /data/kafka # resize the live filesystem
# 3. Check which topics are consuming the most space
du -sh /data/kafka/*/ | sort -rh | head -20
Rolling Upgrades
Never restart all brokers simultaneously. The procedure: stop one broker, wait for all under-replicated partitions to reach zero, then restart and move to the next. With min.insync.replicas=2 and replication.factor=3, the cluster tolerates one broker down with no data loss and no write stall.
for BROKER_HOST in broker1 broker2 broker3; do
ssh $BROKER_HOST "docker compose stop kafka"
echo "Waiting for under-replicated partitions to clear..."
while true; do
URP=$(kafka-topics.sh --bootstrap-server broker1:9092 \
--describe --under-replicated-partitions 2>/dev/null | wc -l)
[ "$URP" -eq 0 ] && break
echo " $URP URP remaining..." && sleep 10
done
ssh $BROKER_HOST "docker compose up -d kafka"
sleep 30 # wait for broker to rejoin ISR
done
# Rebalance leadership after all restarts
kafka-leader-election.sh --bootstrap-server broker1:9092 \
--election-type PREFERRED --all-topic-partitions
Consumer Lag: Monitor the Trend, Not the Number
An absolute lag of 50,000 messages is fine for a batch job processing daily snapshots. The same lag on a payment processing consumer is a crisis. What matters is whether lag is growing. A lag that stays flat at 10,000 is stable. A lag that grew from 100 to 1,000 over 15 minutes needs investigation now.
# Quick CLI check
kafka-consumer-groups.sh --bootstrap-server broker1:9092 \
--describe --group my-consumer-group
# Focus on: is LAG stable? or growing?
# Alert rule (Prometheus/Grafana)
# Alert when: rate of change of consumer lag > 0 over 15-minute window
# Not: when absolute lag > N (this threshold is workload-dependent)
Debezium: The Snapshot Bottleneck
When Debezium starts on a new table — or when you add a table to an existing connector — it runs a full SELECT * snapshot. On a 500 GB table, this runs for hours and floods the cluster. The default Kafka Connect producer settings are tuned for low-latency streaming, not bulk throughput. You must override them for the snapshot phase.
- WAL retention from abandoned slots: An idle Debezium connector holds a replication slot open. Postgres retains WAL indefinitely for that slot. If the slot lags far enough, it can fill the disk and make Postgres go read-only. Set
max_slot_wal_keep_size = '50GB'as a safety cap and monitorpg_replication_slots. - TOAST columns show as NULL on UPDATE: Postgres doesn't include unchanged TOAST (large text/jsonb) columns in WAL. Set
REPLICA IDENTITY FULLon tables with TOAST columns you need in your CDC stream. - Always 1 task: Debezium Postgres reads a single WAL stream.
tasks.maxmust be 1. Setting it higher does nothing except confuse you. - Run the snapshot against a replica: The initial
SELECT *can saturate the primary's I/O. Point Debezium at a read replica for the snapshot phase.
# Monitor replication slot health — run this daily
psql -c "
SELECT slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS wal_lag
FROM pg_replication_slots;
"
# wal_lag should stay well under 1 GB. If it's growing, Debezium is lagging or dead.
MirrorMaker 2: Cross-Team Isolation
MM2's most underrated use case is not disaster recovery — it's team isolation. When Team B needs a subset of Team A's topics, direct cross-cluster consumers create tight operational coupling. A misbehaving consumer in Team B's service can affect Team A's cluster performance. MM2 mirrors specific topics into Team B's own cluster. Team B consumers talk only to their own cluster. The blast radius of Team B's incidents stays within Team B.
source.team-a.events). This prevents infinite replication loops. IdentityReplicationPolicy removes the prefix but causes infinite loops in any active-active configuration. Only use it for one-way migrations where you're certain.
3. Spark in Production
Platform Decision Matrix
| Platform | Best For | Avoid If |
|---|---|---|
| EMR (AWS) | S3-heavy pipelines, AWS-native teams, spot instances for 60-80% cost savings | Jobs need <5 min startup; many small frequent jobs (spin-up cost dominates) |
| Databricks | Fastest time-to-value, Delta Lake, Unity Catalog, notebook-first teams | Tight budget; need full infra control; already have a Spark cluster |
| Kubernetes | Multi-cloud, containerized workloads, existing K8s infra | Small team without dedicated K8s expertise; PV management will cost you time |
| YARN on-prem | Existing Hadoop cluster, data that can't leave your data center | Starting fresh — Hadoop's operational burden is high for the marginal control gained |
Packaging Pitfalls
The most common reason a Spark job works locally and fails on the cluster is a packaging problem.
spark-sql, spark-core, etc. as <scope>provided</scope> in Maven (or compileOnly in Gradle). The cluster already has them. Bundling them causes classloader conflicts that produce cryptic errors at runtime, not at build time.
- Python: Use
--py-files deps.zipfor third-party packages. For reproducibility on EMR, useconda-packto ship a complete environment archive. - Java/Scala: Build an uber-jar with the shade/shadow/assembly plugin. All your deps go in; Spark's deps stay out (
providedscope). - Version alignment: The Scala version in your JAR (
2.12vs2.13) must match what the cluster was built with. A version mismatch throwsClassNotFoundExceptionat runtime.
The Small Files Problem
Every Spark task writes one output file per partition. A job with 1,000 partitions that each write 512 KB produces 1,000 files totaling 500 MB. The next job reading that output spawns 1,000 tasks for 500 MB of data — the task scheduling overhead dominates actual work. This compounds over time: a daily job creating 1,000 small files means 365,000 files after a year.
# Coalesce before write — fewer output files, no full shuffle
# Target: 128-512 MB per output file
df.coalesce(20).write.mode("overwrite").parquet("s3://bucket/output/")
# Do NOT use repartition() here — it does a full shuffle just to reduce file count
# coalesce() achieves the same result by merging partitions on existing tasks
# For Delta Lake / Iceberg: run OPTIMIZE periodically instead of coalescing
# OPTIMIZE rewrites small files into larger ones without reprocessing the data
# spark.sql("OPTIMIZE delta.`s3://bucket/output/`")
spark.sql.adaptive.coalescePartitions.enabled=true and AQE will merge small shuffle partitions at runtime. For file output, you still need to coalesce explicitly — AQE does not control the number of output files.
Broadcast Join for Small Dimension Tables
A sort-merge join on a large fact table against a small dimension table (country codes, product catalog, user segments) shuffles the large table across the network unnecessarily. If the dimension table fits in memory (under 100–200 MB), broadcast it. Each executor gets a local copy; no shuffle needed on the fact table side.
from pyspark.sql.functions import broadcast
# Force a broadcast join for the small dimension table
result = df_orders.join(broadcast(df_country_codes), "country_id")
# Raise the auto-broadcast threshold if your dimension tables are a bit larger
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 100 * 1024 * 1024) # 100 MB
# SQL hint syntax
spark.sql("""
SELECT /*+ BROADCAST(c) */ o.*, c.country_name
FROM orders o JOIN country_codes c ON o.country_id = c.id
""")
Memory Tuning
OOM errors in Spark almost always have one of three causes: too many partitions per executor, a broadcast variable that's too large for the driver, or a collect() that pulls everything to the driver JVM.
| Parameter | Rule of Thumb | Common Mistake |
|---|---|---|
spark.executor.memory |
4–18 GB. 5 cores per executor is a sweet spot for parallelism vs overhead. | Setting it too high leaves no memory for the OS and causes container eviction |
spark.executor.memoryOverhead |
10% of executor memory, min 384 MB. Increase for PySpark (Python process runs outside JVM) | Ignoring it entirely. PySpark OOMs are often overhead exhaustion, not JVM heap |
spark.driver.memory |
4–8 GB. Only needs to be large if you collect() data or broadcast large tables | Setting driver memory the same as executor memory "just to be safe" — wasteful |
4. Airflow in Production
DAG Design Patterns That Hold Up
- One DAG per logical pipeline. Don't build a DAG with 200 tasks that does everything. Small, focused DAGs are easier to monitor, backfill, and reason about when they fail.
- Idempotent tasks. Every task should be safe to re-run without side effects. Write to a staging location, then atomically swap. Use
TRUNCATE + INSERTorINSERT OVERWRITE, not append. - Use template fields for dates. Never hardcode dates in task parameters. Use
{{ ds }},{{ data_interval_start }}. This makes backfilling work correctly without code changes. - SLA tracking per task. Set
slaon tasks that feed dashboards or downstream pipelines. An SLA miss callback is different from a task failure alert — configure both separately.
In a previous platform, the data engineering team manually fielded every analytics request by writing a new DAG. Within 6 months, the team had 400+ DAGs with half owned by people who had left. Backfill requests took 2 weeks because someone had to track down the DAG author. The scheduler was overloaded with zombie DAGs that nobody dared delete.
Lesson: Build self-service data access early — parameterized DAG templates, a simple UI to trigger backfills, and a clear ownership model. Every DAG needs an owner in the code. Ownerless DAGs should not be schedulable.
Alerting: SLA Miss vs Task Failure
These are different failure modes requiring different responses. A task failure means something is broken — the code, an upstream API, a database connection. An SLA miss means the pipeline ran, possibly succeeded, but took longer than expected. Conflating them means your on-call team gets paged for SLA misses on low-priority reporting DAGs and ignores them, then misses a real task failure on a critical pipeline.
- Task failure: Page immediately. Something is broken. Block downstream consumers if they depend on this output.
- SLA miss on a critical pipeline: Page. The data is late, which may mean downstream dashboards are stale or dependent jobs haven't started.
- SLA miss on a low-priority reporting DAG: Ticket, not a page. Log it, review in the morning.
Backfill Strategy
Design for backfill from day one. When a bug is fixed or a new field is added, you will need to re-process historical data. If your DAG reads from Kafka and the data is no longer retained, backfill is impossible. If your task hardcodes TODAY instead of the execution date, every historical run produces today's data.
# Trigger a backfill for a date range
airflow dags backfill --start-date 2024-01-01 --end-date 2024-01-31 my_pipeline_dag
# Backfill runs are marked as such — they don't trigger downstream sensors
# Run with --dry-run first to see what will execute
airflow dags backfill --start-date 2024-01-01 --end-date 2024-01-31 --dry-run my_pipeline_dag
# If your tasks are idempotent, backfill is safe to run anytime
# If they're not — fix them before you need to backfill under pressure
5. Data Quality & Observability
Nobody knows data is bad until a business user spots a wrong number in a dashboard. By then, the problem is usually days old. Build observability into the pipeline, not as an afterthought.
The Minimum Viable Check Set
| Check | What It Catches | Alert Threshold |
|---|---|---|
| Freshness | Source data is stale — upstream job failed silently or stopped producing | Latest record timestamp > expected interval + buffer |
| Row count | Complete data loss, partial load, truncation, dedup failure | Today's count < 80% or > 200% of 7-day average |
| Null rate | Required field going null — upstream schema change, ETL bug | Null rate for required field > 1% |
| Referential integrity | Orphaned records — order with no matching customer | Unmatched rate > 0.1% (tune to your data's real orphan rate) |
| Distribution drift (PSI) | Semantic changes, unit switches, enum expansion | PSI > 0.2 on key fields |
-- Freshness check: has the source updated in the last 2 hours?
SELECT
MAX(event_time) AS latest_event,
CASE
WHEN MAX(event_time) < NOW() - INTERVAL '2 hours' THEN 'STALE'
ELSE 'OK'
END AS freshness_status
FROM order_events;
-- Row count check: compare today vs 7-day average
WITH daily_counts AS (
SELECT DATE(event_time) AS dt, COUNT(*) AS row_count
FROM order_events
WHERE event_time >= NOW() - INTERVAL '8 days'
GROUP BY 1
),
baseline AS (
SELECT AVG(row_count) AS avg_count
FROM daily_counts WHERE dt < CURRENT_DATE
)
SELECT
d.dt,
d.row_count,
b.avg_count,
d.row_count / NULLIF(b.avg_count, 0) AS ratio,
CASE
WHEN d.row_count / NULLIF(b.avg_count, 0) < 0.8 THEN 'LOW'
WHEN d.row_count / NULLIF(b.avg_count, 0) > 2.0 THEN 'HIGH'
ELSE 'OK'
END AS status
FROM daily_counts d, baseline b
WHERE d.dt = CURRENT_DATE;
6. Storage & Table Formats
The Raw Parquet Trap
Starting with raw Parquet files seems simple. After 12 months of production, you have these problems:
- You need to delete a user's data (GDPR request). There is no
DELETEon Parquet — you rewrite every file for every partition that contained their records. - You need to add a column to a historical partition. Parquet files don't evolve; you have to rewrite or accept nulls from Spark's schema merge (fragile, slow).
- You have 500,000 small files because nobody ran compaction. Listing the S3 directory takes 40 seconds before any query runs.
- A failed job wrote partial output and corrupted the partition. There is no transaction log. You don't know exactly which files are good.
Partition Strategy
Partition on fields that appear in WHERE clauses of your most common queries. Date is almost always one of them. For multi-tenant systems, date + tenant_id lets you efficiently answer "show me all events for tenant X in January" without scanning all tenants.
- Don't over-partition: Partitioning on a high-cardinality field (user_id with millions of users) creates millions of directories. S3 list operations become the bottleneck.
- Don't under-partition: No partitioning means every query scans the entire dataset.
- Iceberg hidden partitioning: Iceberg's hidden partitioning (partition by
days(event_time)) avoids the common mistake of forgetting to truncate dates in queries, which bypasses partition pruning in raw Parquet setups.
Compaction
Streaming jobs write many small files — one per micro-batch per partition. Without a compaction job, read performance degrades over time as the file count grows. Run a weekly or daily compaction job that rewrites small files into larger ones (target: 128–512 MB per file).
-- Delta Lake: OPTIMIZE rewrites small files, Z-ORDER clusters related data
OPTIMIZE delta.`s3://bucket/events/`
ZORDER BY (user_id, event_type);
-- Iceberg: rewrite data files smaller than 128 MB
CALL catalog.system.rewrite_data_files(
table => 'db.events',
options => map('target-file-size-bytes', '134217728',
'min-file-size-bytes', '33554432')
);
7. Operational Playbooks
These are the incidents that repeat. Having a written playbook — even a short one — cuts mean time to resolution by at least 50%.
NOT_ENOUGH_REPLICAS or LEADER_NOT_AVAILABLE. Broker logs show disk write errors. Consumer lag spiking on all topics.
df -h /data/kafka. Check which topics are largest: du -sh /data/kafka/*/ | sort -rh | head -10.
kafka-configs.sh --alter --add-config retention.ms=86400000. Wait 5–10 minutes for log cleanup to run. Then expand EBS: aws ec2 modify-volume --size NEW_SIZE + sudo xfs_growfs /data/kafka.
daily_ingest_GB × retention_days × replication_factor × 1.3. Use tiered storage (Kafka 3.6+) to offload cold segments to S3.
pg_replication_slots shows large lag. Possible Postgres disk full alert.
SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) FROM pg_replication_slots;. Check Kafka Connect status: curl http://connect:8083/connectors/postgres-cdc/status | jq .
SELECT pg_drop_replication_slot('slot_name'); — connector will re-snapshot on restart.
max_slot_wal_keep_size = '50GB' in Postgres config. Alert on pg_replication_slots lag > 10 GB. Never let a Debezium connector sit stopped for more than a few hours on a high-write database.
ExecutorLostFailure (executor X exited caused by one of the running tasks). YARN or K8s killed the container. Stage keeps retrying.
spark.executor.memory or reduce spark.executor.cores (fewer concurrent tasks per executor = less heap pressure). For PySpark OOM: increase spark.executor.memoryOverhead. For skew OOM: salt the join key or use AQE skew join handling.
execution_timeout on every task. Sensors should have timeout and mode='reschedule' (frees the worker slot while waiting). Set dagrun_timeout to kill entire DAG runs that exceed a maximum wall clock time.
8. Meta-Lessons
These are the things you only learn by shipping and running data systems in production.
Start Simple. Add Complexity When the Data Tells You To.
Every Kafka cluster that started as MSK and ended as a self-managed Strimzi cluster on K8s did so because someone estimated the scale wrong upfront. Start with the managed service. When you have actual throughput numbers, cost data, and specific limitations you've hit, then make the case for self-managing. Premature infrastructure optimization is the same mistake as premature code optimization — just more expensive.
Operational Experience Is a Superpower.
Most engineers who work with data systems have read the documentation. Few have operated them under pressure — disk full at 2am, cascade failures, trying to explain to a VP why the dashboard is wrong. The engineers who have been through production incidents systematically design better systems. They put the 75% disk alert in, they write the playbook, they design the schema to be evolvable. Operational experience is not separate from engineering skill — it is engineering skill.
Data Modeling Is the Highest-Leverage Work.
You can replace Spark with Trino, swap Kafka for Kinesis, and migrate from Parquet to Iceberg. What you cannot easily change is the data model. A poorly modeled fact table that's been in production for two years, with 30 downstream consumers, costs orders of magnitude more to fix than the same mistake caught at design time. Spend disproportionate time on the data model. Get buy-in from upstream producers and downstream consumers before writing a line of pipeline code.
The Best Architecture Is the One Your Team Can Operate.
A sophisticated Lambda architecture with exactly-once Kafka streams, a Flink stateful processor, and an Iceberg lakehouse with row-level deletes is impressive on paper. If your team has two data engineers and neither has operated Flink before, it will be a production incident waiting to happen. Choose the simplest architecture that meets your SLA. Add sophistication only when the simpler approach demonstrably fails to meet requirements.
Invest in Data Contracts Early.
Breaking schema changes cause more production incidents than infrastructure failures. Infrastructure fails loudly — you get an alert, you page someone, you fix it. Schema drift fails silently — your KPI dashboard shows numbers that are 100x off and nobody notices for a week. The cost of setting up consumer-driven contracts and distribution monitoring in month 1 is small. The cost of retrofitting data quality practices after your data has 20 upstream producers and 50 downstream consumers is enormous.
- Ship raw Parquet to S3, everything works fine
- Data quality issues surface, add basic checks
- Schema changes break consumers, add Schema Registry
- Small files accumulate, add compaction
- Need row-level deletes, migrate to Iceberg
- Semantic drift causes dashboard errors, add distribution monitoring
You will go through these steps. The question is whether you do each one reactively (after an incident) or proactively (before it costs you). This page exists so you can do them proactively.