Data Observability: Detecting Silent Pipeline Failures Before Users Do
How to catch data quality failures — null explosions, schema drift, stale tables, distribution shifts — before they reach your dashboards and erode trust in your data.
The Silent Failure Problem
Traditional monitoring catches infrastructure failures: server down, disk full, query timeout. Data pipelines have a different failure mode: the pipeline runs successfully, but the data is wrong.
A column silently becomes 80% null. A currency conversion uses yesterday's exchange rate. An upstream team changes a field name. Your pipeline keeps running. Your dashboard keeps updating. Your stakeholders keep making decisions on corrupted data.
Data observability is the practice of catching these failures before they reach your consumers.
The Four Failure Categories
1. Freshness failures — the table wasn't updated when it should have been.
-- Check: last update should be within 26 hours for a daily pipeline
SELECT
table_name,
MAX(updated_at) AS last_update,
CURRENT_TIMESTAMP - MAX(updated_at) AS staleness
FROM your_fact_table
GROUP BY table_name
HAVING CURRENT_TIMESTAMP - MAX(updated_at) > INTERVAL '26 hours';
2. Volume failures — row count is abnormally high or low.
3. Schema failures — column added, removed, or type changed.
4. Distribution failures — value distributions shifted significantly (null rate, numeric range, cardinality).
Implementing Tests with Great Expectations
Great Expectations lets you define a "data contract" that runs after every pipeline execution.
import great_expectations as gx
context = gx.get_context()
# Define expectations on your orders table
validator = context.get_validator(
datasource_name="production_db",
data_asset_name="orders",
)
# Volume: at least 100 orders per day
validator.expect_table_row_count_to_be_between(
min_value=100,
max_value=None
)
# No nulls on critical columns
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_not_be_null("amount_cents")
# Amount must be positive
validator.expect_column_values_to_be_between(
column="amount_cents",
min_value=1,
max_value=10_000_000 # $100k max — flag anything above
)
# Status must be from known set
validator.expect_column_values_to_be_in_set(
column="status",
value_set=["pending", "confirmed", "shipped", "cancelled", "refunded"]
)
# Save and run
validator.save_expectation_suite(discard_failed_expectations=False)
results = validator.validate()
if not results.success:
# Alert the data engineering team
send_alert(f"Data quality check failed: {results.statistics}")
raise Exception("Pipeline halted due to data quality failure")
Distribution Drift Detection
Point-in-time checks catch known failures. Distribution drift catches the unknown ones.
from scipy import stats
import pandas as pd
def detect_distribution_drift(
reference: pd.Series,
current: pd.Series,
column_name: str,
threshold: float = 0.05
) -> dict:
"""
Use Kolmogorov-Smirnov test for numeric columns,
chi-square test for categorical columns.
"""
if reference.dtype in ["float64", "int64"]:
statistic, p_value = stats.ks_2samp(
reference.dropna(),
current.dropna()
)
else:
# Categorical: compare frequency distributions
ref_freq = reference.value_counts(normalize=True)
cur_freq = current.value_counts(normalize=True)
aligned = pd.concat([ref_freq, cur_freq], axis=1, keys=["ref", "cur"]).fillna(0)
statistic, p_value = stats.chisquare(aligned["cur"], aligned["ref"])
drifted = p_value < threshold
return {
"column": column_name,
"drifted": drifted,
"p_value": round(p_value, 4),
"action": "ALERT" if drifted else "OK"
}
Building the Monitoring Dashboard
Track these metrics per table in Grafana:
| Metric | Type | Alert |
|---|---|---|
rows_last_24h | Gauge | < p10 of 30-day rolling |
null_rate_{column} | Gauge | > 2× 7-day average |
time_since_last_update | Gauge | > SLA threshold |
schema_change_detected | Counter | Any change |
Write these metrics from your pipeline orchestrator (Airflow, Prefect) at the end of each DAG run.
The Mindset Shift
Data observability isn't a tool you buy — it's a practice you build. Start with freshness checks (easiest, highest impact), add volume tests, then layer in distribution monitoring as your pipeline matures.
The goal isn't perfect data. It's knowing when your data is imperfect before your stakeholders find out.