tadata
Back to home

Data Quality: Dimensions, Frameworks, and Observability

#data-quality#data-engineering#data-governance#testing

Why Data Quality Is a Strategic Issue

Bad data costs money. Gartner estimates that poor data quality costs organizations an average of $12.9 million per year. But the real cost is lost trust: once stakeholders stop trusting dashboards, they revert to gut decisions and spreadsheets.

Data quality is not a one-time fix. It is a continuous discipline.

The Six Dimensions of Data Quality

DimensionDefinitionExample Check
AccuracyData correctly represents the real-world entityCustomer email matches verified source
CompletenessRequired fields are populatedNo null values in mandatory columns
ConsistencySame data across systems agreesOrder total in billing = order total in analytics
TimelinessData arrives when expectedDaily pipeline completes before 8am
UniquenessNo unintended duplicatesOne row per customer per day
ValidityData conforms to defined rules/formatsDates are in ISO 8601, status in allowed enum

Testing vs Observability

These are complementary, not competing approaches:

AspectData TestingData Observability
ApproachDefine explicit assertionsMonitor for anomalies automatically
WhenDuring pipeline executionContinuously, including at rest
What it catchesKnown failure modesUnknown unknowns (drift, distribution shifts)
EffortRequires writing testsRequires configuring monitors
Examplesdbt tests, Great ExpectationsMonte Carlo, Soda, Anomalo

A mature data quality strategy uses both: tests for known invariants, observability for unexpected changes.

Framework Comparison

ToolTypeBest ForIntegration
Great ExpectationsOpen source testingPython-native teams, batch pipelinesAirflow, Spark, pandas
SodaOpen source + commercialSQL-first teams, simple checksWarehouse-native, Airflow
dbt testsBuilt into dbtTeams already using dbtdbt projects
Monte CarloCommercial observabilityEnd-to-end monitoring, incident managementBroad warehouse/BI integration
AnomaloCommercial observabilityAutomated anomaly detectionWarehouse-native
ElementaryOpen source (dbt-native)dbt shops wanting observabilitydbt projects

Building a Data Quality Strategy

Layer 1: Schema validation

  • Enforce schemas at ingestion (schema registries, table formats)
  • Catch structural breaks before they propagate

Layer 2: Business rule testing

  • Not-null checks on required fields
  • Referential integrity (foreign keys exist in parent table)
  • Range checks (age > 0, price >= 0)
  • Uniqueness constraints

Layer 3: Statistical monitoring

  • Volume anomalies (row count deviates from expected range)
  • Distribution drift (column statistics shift beyond threshold)
  • Freshness monitoring (data arrives on schedule)

Layer 4: Cross-system reconciliation

  • Source-to-target row count matching
  • Aggregate value reconciliation between systems
  • End-to-end lineage-aware checks

Data Quality SLOs

Treat data quality like service reliability. Define SLOs:

  • Freshness: table updated within 2 hours of source
  • Completeness: critical columns have < 0.1% null rate
  • Volume: daily row count within 2 standard deviations of 30-day mean
  • Schema: zero unexpected schema changes without review

Track SLO compliance over time. Report on it. Make it visible.

Common Mistakes

  • Testing only in development, not in production
  • Writing hundreds of tests but never reviewing failures (alert fatigue)
  • Treating data quality as the data team's problem (producers must own quality)
  • Not connecting quality issues to business impact

Resources