Data Lineage & Data Observability: From Source to Consumption
Why Lineage Matters
When a dashboard shows wrong numbers, the first question is always: "Where does this data come from?" Without lineage, answering that question requires manually tracing through pipelines, scripts, and transformations. This can take hours or days.
Data lineage maps the journey of data from source to destination, including every transformation along the way.
Levels of Lineage
| Level | What It Tracks | Value |
|---|---|---|
| Table-level | Which tables feed which tables | Basic impact analysis |
| Column-level | Which columns derive from which columns | Precise root cause analysis |
| Row-level | Which specific records contributed to an output | Audit and compliance |
| Business-level | How business metrics relate to underlying data | Executive trust |
Most organizations start with table-level lineage and progress to column-level. Row-level lineage is typically reserved for regulatory requirements.
Lineage Collection Methods
| Method | How It Works | Pros | Cons |
|---|---|---|---|
| SQL parsing | Analyze SQL statements to extract dependencies | Works retroactively, no code changes | Limited to SQL workloads |
| API/Hook integration | Orchestrators and engines emit lineage events | Real-time, accurate | Requires integration per tool |
| OpenLineage standard | Common spec for lineage events across tools | Vendor-neutral, composable | Adoption still growing |
| Manual annotation | Engineers document dependencies | Works for any system | Quickly outdated |
The OpenLineage Ecosystem
OpenLineage is an open standard for lineage event collection. Key components:
- Producers: Airflow, Spark, dbt, Flink emit OpenLineage events
- Transport: Events sent via HTTP, Kafka, or file
- Consumers: Marquez, DataHub, Atlan ingest and display lineage
This decouples lineage collection from lineage visualization. You can switch catalogs without re-instrumenting pipelines.
Data Observability
Data observability extends monitoring beyond pipeline execution to the data itself. The five pillars:
- Freshness: Is the data up to date?
- Volume: Did the expected number of rows arrive?
- Schema: Did the structure change unexpectedly?
- Distribution: Are column values within expected ranges?
- Lineage: What upstream/downstream is affected?
Tool Landscape
| Tool | Focus | Type | Key Strength |
|---|---|---|---|
| OpenLineage | Lineage standard | Open standard | Vendor-neutral collection |
| Marquez | Lineage store | Open source | Reference OpenLineage backend |
| Monte Carlo | Observability | Commercial | End-to-end monitoring, incident mgmt |
| Atlan | Catalog + lineage | Commercial | Active metadata + lineage |
| DataHub | Catalog + lineage | Open source | Extensible, column-level lineage |
| Elementary | dbt observability | Open source | Deep dbt integration |
| Soda | Quality + observability | Open source + commercial | SQL-first checks |
Impact Analysis and Root Cause Analysis
Impact analysis (forward lineage): "If I change this table, what downstream dashboards, models, and reports break?"
Use cases:
- Schema migration planning
- Deprecating tables safely
- Understanding blast radius of changes
Root cause analysis (backward lineage): "This metric is wrong -- which upstream pipeline or source caused it?"
Use cases:
- Incident triage
- Data quality debugging
- Audit trail for regulatory compliance
Implementation Roadmap
- Instrument orchestrators first: Airflow/dbt emit lineage events with minimal effort
- Deploy a lineage store: Marquez or DataHub to collect and query lineage
- Add column-level lineage: Enable SQL parsing for warehouse transformations
- Connect to catalog: Lineage should be visible where people discover data
- Build observability monitors: Freshness, volume, and distribution checks on critical tables
- Integrate into incident workflow: Lineage-powered impact analysis in PagerDuty/Slack alerts