tadata
Back to home

Data Lineage & Data Observability: From Source to Consumption

#data-lineage#data-observability#data-engineering#governance

Why Lineage Matters

When a dashboard shows wrong numbers, the first question is always: "Where does this data come from?" Without lineage, answering that question requires manually tracing through pipelines, scripts, and transformations. This can take hours or days.

Data lineage maps the journey of data from source to destination, including every transformation along the way.

Levels of Lineage

LevelWhat It TracksValue
Table-levelWhich tables feed which tablesBasic impact analysis
Column-levelWhich columns derive from which columnsPrecise root cause analysis
Row-levelWhich specific records contributed to an outputAudit and compliance
Business-levelHow business metrics relate to underlying dataExecutive trust

Most organizations start with table-level lineage and progress to column-level. Row-level lineage is typically reserved for regulatory requirements.

Lineage Collection Methods

MethodHow It WorksProsCons
SQL parsingAnalyze SQL statements to extract dependenciesWorks retroactively, no code changesLimited to SQL workloads
API/Hook integrationOrchestrators and engines emit lineage eventsReal-time, accurateRequires integration per tool
OpenLineage standardCommon spec for lineage events across toolsVendor-neutral, composableAdoption still growing
Manual annotationEngineers document dependenciesWorks for any systemQuickly outdated

The OpenLineage Ecosystem

OpenLineage is an open standard for lineage event collection. Key components:

  • Producers: Airflow, Spark, dbt, Flink emit OpenLineage events
  • Transport: Events sent via HTTP, Kafka, or file
  • Consumers: Marquez, DataHub, Atlan ingest and display lineage

This decouples lineage collection from lineage visualization. You can switch catalogs without re-instrumenting pipelines.

Data Observability

Data observability extends monitoring beyond pipeline execution to the data itself. The five pillars:

  • Freshness: Is the data up to date?
  • Volume: Did the expected number of rows arrive?
  • Schema: Did the structure change unexpectedly?
  • Distribution: Are column values within expected ranges?
  • Lineage: What upstream/downstream is affected?

Tool Landscape

ToolFocusTypeKey Strength
OpenLineageLineage standardOpen standardVendor-neutral collection
MarquezLineage storeOpen sourceReference OpenLineage backend
Monte CarloObservabilityCommercialEnd-to-end monitoring, incident mgmt
AtlanCatalog + lineageCommercialActive metadata + lineage
DataHubCatalog + lineageOpen sourceExtensible, column-level lineage
Elementarydbt observabilityOpen sourceDeep dbt integration
SodaQuality + observabilityOpen source + commercialSQL-first checks

Impact Analysis and Root Cause Analysis

Impact analysis (forward lineage): "If I change this table, what downstream dashboards, models, and reports break?"

Use cases:

  • Schema migration planning
  • Deprecating tables safely
  • Understanding blast radius of changes

Root cause analysis (backward lineage): "This metric is wrong -- which upstream pipeline or source caused it?"

Use cases:

  • Incident triage
  • Data quality debugging
  • Audit trail for regulatory compliance

Implementation Roadmap

  1. Instrument orchestrators first: Airflow/dbt emit lineage events with minimal effort
  2. Deploy a lineage store: Marquez or DataHub to collect and query lineage
  3. Add column-level lineage: Enable SQL parsing for warehouse transformations
  4. Connect to catalog: Lineage should be visible where people discover data
  5. Build observability monitors: Freshness, volume, and distribution checks on critical tables
  6. Integrate into incident workflow: Lineage-powered impact analysis in PagerDuty/Slack alerts

Resources