tadata
Back to home

Observability for Data Pipelines: Applying SRE Principles to Data

#data-engineering#observability#data-quality#monitoring

Software engineering solved reliability years ago with SRE practices: SLOs, SLIs, incident response, and blameless postmortems. Data engineering is only now catching up. Data downtime -- periods when data is missing, inaccurate, or stale -- costs organizations millions in bad decisions and eroded trust. This post maps SRE observability principles onto the data domain.

The Five Pillars of Data Observability

Data Observability Pillars
│
├── 1. Freshness
│   └── Is the data arriving on time?
│       ├── SLI: Time since last update
│       └── SLO: Table X updated within 2 hours of source event
│
├── 2. Volume
│   └── Is the expected amount of data present?
│       ├── SLI: Row count delta vs expected
│       └── SLO: Daily row count within 10% of 7-day moving average
│
├── 3. Schema
│   └── Has the structure changed unexpectedly?
│       ├── SLI: Schema diff count per deployment
│       └── SLO: Zero unannounced breaking schema changes
│
├── 4. Distribution
│   └── Are values within expected ranges?
│       ├── SLI: Percentage of values outside learned bounds
│       └── SLO: < 0.1% of values flagged as anomalous
│
└── 5. Lineage
    └── Can we trace data from source to consumption?
        ├── SLI: Percentage of tables with complete lineage
        └── SLO: 100% of critical path tables have lineage

Data SLO Framework

SLO CategoryExample SLOMeasurement MethodAlert Threshold
FreshnessOrders table updated within 1h of eventLast modified timestamp> 1.5h since last update
Completeness< 0.5% null values in required fieldsNull count / total rows> 0.5% nulls
Uniqueness0 duplicate primary keysDuplicate count query> 0 duplicates
AccuracyRevenue totals match source within 0.01%Cross-system reconciliation> 0.01% delta
VolumeDaily row count within 2 sigma of 30-day avgStatistical comparisonOutside 2 sigma
Schema stabilityZero breaking changes without 7-day noticeSchema diff monitoringAny breaking change

Pipeline SLI Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Source       │────>│  Pipeline    │────>│  Destination  │
│  Systems      │     │  (ETL/ELT)  │     │  (Warehouse)  │
└──────┬───────┘     └──────┬───────┘     └──────┬────────┘
       │                    │                     │
       ▼                    ▼                     ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│ Extraction   │     │ Transform    │     │ Load         │
│ SLIs:        │     │ SLIs:        │     │ SLIs:        │
│ - Latency    │     │ - Duration   │     │ - Freshness  │
│ - Error rate │     │ - Row delta  │     │ - Volume     │
│ - Throughput │     │ - Schema     │     │ - Distribution│
└──────────────┘     │   drift      │     │ - Quality    │
                     └──────────────┘     │   score      │
                                          └──────────────┘
                            │
                            ▼
                     ┌──────────────┐
                     │ Observability │
                     │ Platform      │
                     │ - Metrics     │
                     │ - Alerts      │
                     │ - Lineage     │
                     │ - Incidents   │
                     └──────────────┘

Tool Comparison

CapabilityMonte CarloBigeyeSodaDatafoldGreat ExpectationsElementary
ApproachML-based anomalyMetric-based rulesCheck-as-codeDiff-basedTest-as-codedbt-native
DeploymentSaaS onlySaaS + agentOSS + CloudSaaS + OSSOSS + CloudOSS + Cloud
Warehouse supportBroadBroadBroadSnowflake, BQ, DatabricksBroaddbt projects
Schema monitoringAutomaticManual rulesManual checksAutomatic diffManual assertionsAutomatic
Anomaly detectionML-drivenStatistical rulesThreshold-basedDiff comparisonThreshold-basedStatistical
LineageAutomatic (query log)LimitedNoneColumn-levelNonedbt lineage
Incident managementBuilt-inBasicNoneNoneNoneSlack/email
Pricing modelPer table/monthPer metricOSS + enterprisePer repoOSS + enterpriseOSS + cloud
Best forEnterprise, hands-offMetric-heavy orgsdbt/code-first teamsCI/CD for dataTesting-focused teamsdbt shops

Incident Response for Data

PhaseSoftware SREData SRE Equivalent
DetectMonitoring + alertingData quality monitors + freshness alerts
TriageSeverity classificationImpact assessment: which dashboards/models affected?
InvestigateLogs, traces, metricsLineage traversal, schema diffs, volume analysis
MitigateRollback, feature flagPause downstream pipelines, switch to last-known-good
ResolveFix + deployFix source/transform, backfill, validate
LearnBlameless postmortemData incident review: add monitors, update SLOs

Metric Framework: What to Measure

LayerMetricTargetTool
InfrastructurePipeline success rate> 99.5%Airflow/Dagster metrics
InfrastructurePipeline duration p95< 2x baselineOrchestrator + Prometheus
Data qualityFreshness SLO adherence> 99%Data observability platform
Data qualityQuality score (composite)> 95%Custom or vendor
Business impactDashboard staleness< 30 minBI tool API
Business impactData-related support ticketsTrending downTicketing system

Resources