Back to Blog
Data Pipelines
MLOps

Data Observability: Detecting Silent Pipeline Failures Before Users Do

How to catch data quality failures — null explosions, schema drift, stale tables, distribution shifts — before they reach your dashboards and erode trust in your data.

7 min readMay 20, 2026Netvionix Team
Data Observability: Detecting Silent Pipeline Failures Before Users Do

The Silent Failure Problem

Traditional monitoring catches infrastructure failures: server down, disk full, query timeout. Data pipelines have a different failure mode: the pipeline runs successfully, but the data is wrong.

A column silently becomes 80% null. A currency conversion uses yesterday's exchange rate. An upstream team changes a field name. Your pipeline keeps running. Your dashboard keeps updating. Your stakeholders keep making decisions on corrupted data.

Data observability is the practice of catching these failures before they reach your consumers.


The Four Failure Categories

1. Freshness failures — the table wasn't updated when it should have been.

-- Check: last update should be within 26 hours for a daily pipeline
SELECT
    table_name,
    MAX(updated_at) AS last_update,
    CURRENT_TIMESTAMP - MAX(updated_at) AS staleness
FROM your_fact_table
GROUP BY table_name
HAVING CURRENT_TIMESTAMP - MAX(updated_at) > INTERVAL '26 hours';

2. Volume failures — row count is abnormally high or low.

3. Schema failures — column added, removed, or type changed.

4. Distribution failures — value distributions shifted significantly (null rate, numeric range, cardinality).


Implementing Tests with Great Expectations

Great Expectations lets you define a "data contract" that runs after every pipeline execution.

import great_expectations as gx

context = gx.get_context()

# Define expectations on your orders table
validator = context.get_validator(
    datasource_name="production_db",
    data_asset_name="orders",
)

# Volume: at least 100 orders per day
validator.expect_table_row_count_to_be_between(
    min_value=100,
    max_value=None
)

# No nulls on critical columns
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_not_be_null("amount_cents")

# Amount must be positive
validator.expect_column_values_to_be_between(
    column="amount_cents",
    min_value=1,
    max_value=10_000_000  # $100k max — flag anything above
)

# Status must be from known set
validator.expect_column_values_to_be_in_set(
    column="status",
    value_set=["pending", "confirmed", "shipped", "cancelled", "refunded"]
)

# Save and run
validator.save_expectation_suite(discard_failed_expectations=False)
results = validator.validate()

if not results.success:
    # Alert the data engineering team
    send_alert(f"Data quality check failed: {results.statistics}")
    raise Exception("Pipeline halted due to data quality failure")

Distribution Drift Detection

Point-in-time checks catch known failures. Distribution drift catches the unknown ones.

from scipy import stats
import pandas as pd

def detect_distribution_drift(
    reference: pd.Series,
    current: pd.Series,
    column_name: str,
    threshold: float = 0.05
) -> dict:
    """
    Use Kolmogorov-Smirnov test for numeric columns,
    chi-square test for categorical columns.
    """
    if reference.dtype in ["float64", "int64"]:
        statistic, p_value = stats.ks_2samp(
            reference.dropna(),
            current.dropna()
        )
    else:
        # Categorical: compare frequency distributions
        ref_freq = reference.value_counts(normalize=True)
        cur_freq = current.value_counts(normalize=True)
        aligned = pd.concat([ref_freq, cur_freq], axis=1, keys=["ref", "cur"]).fillna(0)
        statistic, p_value = stats.chisquare(aligned["cur"], aligned["ref"])

    drifted = p_value < threshold

    return {
        "column": column_name,
        "drifted": drifted,
        "p_value": round(p_value, 4),
        "action": "ALERT" if drifted else "OK"
    }

Building the Monitoring Dashboard

Track these metrics per table in Grafana:

MetricTypeAlert
rows_last_24hGauge< p10 of 30-day rolling
null_rate_{column}Gauge> 2× 7-day average
time_since_last_updateGauge> SLA threshold
schema_change_detectedCounterAny change

Write these metrics from your pipeline orchestrator (Airflow, Prefect) at the end of each DAG run.


The Mindset Shift

Data observability isn't a tool you buy — it's a practice you build. Start with freshness checks (easiest, highest impact), add volume tests, then layer in distribution monitoring as your pipeline matures.

The goal isn't perfect data. It's knowing when your data is imperfect before your stakeholders find out.