Software engineering solved reliability years ago with SRE practices: SLOs, SLIs, incident response, and blameless postmortems. Data engineering is only now catching up. Data downtime -- periods when data is missing, inaccurate, or stale -- costs organizations millions in bad decisions and eroded trust. This post maps SRE observability principles onto the data domain.
The Five Pillars of Data Observability
Data Observability Pillars
│
├── 1. Freshness
│ └── Is the data arriving on time?
│ ├── SLI: Time since last update
│ └── SLO: Table X updated within 2 hours of source event
│
├── 2. Volume
│ └── Is the expected amount of data present?
│ ├── SLI: Row count delta vs expected
│ └── SLO: Daily row count within 10% of 7-day moving average
│
├── 3. Schema
│ └── Has the structure changed unexpectedly?
│ ├── SLI: Schema diff count per deployment
│ └── SLO: Zero unannounced breaking schema changes
│
├── 4. Distribution
│ └── Are values within expected ranges?
│ ├── SLI: Percentage of values outside learned bounds
│ └── SLO: < 0.1% of values flagged as anomalous
│
└── 5. Lineage
└── Can we trace data from source to consumption?
├── SLI: Percentage of tables with complete lineage
└── SLO: 100% of critical path tables have lineage
Data SLO Framework
| SLO Category | Example SLO | Measurement Method | Alert Threshold |
|---|
| Freshness | Orders table updated within 1h of event | Last modified timestamp | > 1.5h since last update |
| Completeness | < 0.5% null values in required fields | Null count / total rows | > 0.5% nulls |
| Uniqueness | 0 duplicate primary keys | Duplicate count query | > 0 duplicates |
| Accuracy | Revenue totals match source within 0.01% | Cross-system reconciliation | > 0.01% delta |
| Volume | Daily row count within 2 sigma of 30-day avg | Statistical comparison | Outside 2 sigma |
| Schema stability | Zero breaking changes without 7-day notice | Schema diff monitoring | Any breaking change |
Pipeline SLI Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Source │────>│ Pipeline │────>│ Destination │
│ Systems │ │ (ETL/ELT) │ │ (Warehouse) │
└──────┬───────┘ └──────┬───────┘ └──────┬────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Extraction │ │ Transform │ │ Load │
│ SLIs: │ │ SLIs: │ │ SLIs: │
│ - Latency │ │ - Duration │ │ - Freshness │
│ - Error rate │ │ - Row delta │ │ - Volume │
│ - Throughput │ │ - Schema │ │ - Distribution│
└──────────────┘ │ drift │ │ - Quality │
└──────────────┘ │ score │
└──────────────┘
│
▼
┌──────────────┐
│ Observability │
│ Platform │
│ - Metrics │
│ - Alerts │
│ - Lineage │
│ - Incidents │
└──────────────┘
Tool Comparison
| Capability | Monte Carlo | Bigeye | Soda | Datafold | Great Expectations | Elementary |
|---|
| Approach | ML-based anomaly | Metric-based rules | Check-as-code | Diff-based | Test-as-code | dbt-native |
| Deployment | SaaS only | SaaS + agent | OSS + Cloud | SaaS + OSS | OSS + Cloud | OSS + Cloud |
| Warehouse support | Broad | Broad | Broad | Snowflake, BQ, Databricks | Broad | dbt projects |
| Schema monitoring | Automatic | Manual rules | Manual checks | Automatic diff | Manual assertions | Automatic |
| Anomaly detection | ML-driven | Statistical rules | Threshold-based | Diff comparison | Threshold-based | Statistical |
| Lineage | Automatic (query log) | Limited | None | Column-level | None | dbt lineage |
| Incident management | Built-in | Basic | None | None | None | Slack/email |
| Pricing model | Per table/month | Per metric | OSS + enterprise | Per repo | OSS + enterprise | OSS + cloud |
| Best for | Enterprise, hands-off | Metric-heavy orgs | dbt/code-first teams | CI/CD for data | Testing-focused teams | dbt shops |
Incident Response for Data
| Phase | Software SRE | Data SRE Equivalent |
|---|
| Detect | Monitoring + alerting | Data quality monitors + freshness alerts |
| Triage | Severity classification | Impact assessment: which dashboards/models affected? |
| Investigate | Logs, traces, metrics | Lineage traversal, schema diffs, volume analysis |
| Mitigate | Rollback, feature flag | Pause downstream pipelines, switch to last-known-good |
| Resolve | Fix + deploy | Fix source/transform, backfill, validate |
| Learn | Blameless postmortem | Data incident review: add monitors, update SLOs |
Metric Framework: What to Measure
| Layer | Metric | Target | Tool |
|---|
| Infrastructure | Pipeline success rate | > 99.5% | Airflow/Dagster metrics |
| Infrastructure | Pipeline duration p95 | < 2x baseline | Orchestrator + Prometheus |
| Data quality | Freshness SLO adherence | > 99% | Data observability platform |
| Data quality | Quality score (composite) | > 95% | Custom or vendor |
| Business impact | Dashboard staleness | < 30 min | BI tool API |
| Business impact | Data-related support tickets | Trending down | Ticketing system |
Resources