Data Quality: Dimensions, Frameworks, and Observability
#data-quality#data-engineering#data-governance#testing
Why Data Quality Is a Strategic Issue
Bad data costs money. Gartner estimates that poor data quality costs organizations an average of $12.9 million per year. But the real cost is lost trust: once stakeholders stop trusting dashboards, they revert to gut decisions and spreadsheets.
Data quality is not a one-time fix. It is a continuous discipline.
The Six Dimensions of Data Quality
| Dimension | Definition | Example Check |
|---|---|---|
| Accuracy | Data correctly represents the real-world entity | Customer email matches verified source |
| Completeness | Required fields are populated | No null values in mandatory columns |
| Consistency | Same data across systems agrees | Order total in billing = order total in analytics |
| Timeliness | Data arrives when expected | Daily pipeline completes before 8am |
| Uniqueness | No unintended duplicates | One row per customer per day |
| Validity | Data conforms to defined rules/formats | Dates are in ISO 8601, status in allowed enum |
Testing vs Observability
These are complementary, not competing approaches:
| Aspect | Data Testing | Data Observability |
|---|---|---|
| Approach | Define explicit assertions | Monitor for anomalies automatically |
| When | During pipeline execution | Continuously, including at rest |
| What it catches | Known failure modes | Unknown unknowns (drift, distribution shifts) |
| Effort | Requires writing tests | Requires configuring monitors |
| Examples | dbt tests, Great Expectations | Monte Carlo, Soda, Anomalo |
A mature data quality strategy uses both: tests for known invariants, observability for unexpected changes.
Framework Comparison
| Tool | Type | Best For | Integration |
|---|---|---|---|
| Great Expectations | Open source testing | Python-native teams, batch pipelines | Airflow, Spark, pandas |
| Soda | Open source + commercial | SQL-first teams, simple checks | Warehouse-native, Airflow |
| dbt tests | Built into dbt | Teams already using dbt | dbt projects |
| Monte Carlo | Commercial observability | End-to-end monitoring, incident management | Broad warehouse/BI integration |
| Anomalo | Commercial observability | Automated anomaly detection | Warehouse-native |
| Elementary | Open source (dbt-native) | dbt shops wanting observability | dbt projects |
Building a Data Quality Strategy
Layer 1: Schema validation
- Enforce schemas at ingestion (schema registries, table formats)
- Catch structural breaks before they propagate
Layer 2: Business rule testing
- Not-null checks on required fields
- Referential integrity (foreign keys exist in parent table)
- Range checks (age > 0, price >= 0)
- Uniqueness constraints
Layer 3: Statistical monitoring
- Volume anomalies (row count deviates from expected range)
- Distribution drift (column statistics shift beyond threshold)
- Freshness monitoring (data arrives on schedule)
Layer 4: Cross-system reconciliation
- Source-to-target row count matching
- Aggregate value reconciliation between systems
- End-to-end lineage-aware checks
Data Quality SLOs
Treat data quality like service reliability. Define SLOs:
- Freshness: table updated within 2 hours of source
- Completeness: critical columns have < 0.1% null rate
- Volume: daily row count within 2 standard deviations of 30-day mean
- Schema: zero unexpected schema changes without review
Track SLO compliance over time. Report on it. Make it visible.
Common Mistakes
- Testing only in development, not in production
- Writing hundreds of tests but never reviewing failures (alert fatigue)
- Treating data quality as the data team's problem (producers must own quality)
- Not connecting quality issues to business impact