tadata
Back to home

Data Contracts: Schema Agreements Between Producers and Consumers

#data-engineering#data-governance#data-mesh#api

The Problem Data Contracts Solve

In most organizations, data producers (application teams) and data consumers (analytics, ML teams) have no formal agreement. The application team changes a column name, adds a field, or modifies an enum value, and downstream pipelines silently break.

Data contracts establish an explicit agreement between producers and consumers about the shape, semantics, and quality of data.

What a Data Contract Contains

ElementDescriptionExample
SchemaField names, types, nullabilityuser_id: string, not null
SemanticsBusiness meaning of each field"user_id is the globally unique identifier from auth system"
Quality guaranteesSLOs for freshness, completeness, validity"Updated within 1 hour, < 0.01% nulls on required fields"
OwnershipTeam responsible for the contract"Team: User Platform, Slack: #user-data"
VersioningHow the contract evolves"Semantic versioning: breaking = major, additive = minor"
Access policiesWho can consume, under what conditions"PII fields restricted to compliance-approved consumers"

Breaking vs Non-Breaking Changes

Change TypeBreakingNon-Breaking
Remove a fieldYes--
Rename a fieldYes--
Change field typeYes--
Add a new optional field--Yes
Widen a type (int32 to int64)DependsUsually yes
Add an enum valueDependsIf consumers handle unknown values
Tighten nullability (nullable to not-null)--Yes
Loosen nullability (not-null to nullable)Yes--

Breaking changes require consumer coordination. Non-breaking changes can be deployed independently.

Schema Registries

Schema registries enforce contract compliance at runtime:

RegistryFormat SupportIntegration
Confluent Schema RegistryAvro, Protobuf, JSON SchemaKafka-native
AWS Glue Schema RegistryAvro, JSON SchemaAWS ecosystem
Apicurio RegistryAvro, Protobuf, JSON Schema, OpenAPIOpen source, vendor-neutral
BufProtobufgRPC/Protobuf workflows

Schema registries enforce compatibility rules (backward, forward, full) automatically. A producer cannot publish a breaking change without explicitly bumping the version.

The Data Contract Specification

The open-source Data Contract Specification (datacontract.com) provides a YAML-based format:

dataContractSpecification: 0.9.3
id: urn:datacontract:checkout:orders
info:
  title: Orders
  version: 1.0.0
  owner: checkout-team
servers:
  production:
    type: snowflake
    account: xyz
models:
  orders:
    fields:
      order_id:
        type: string
        required: true
        unique: true
      customer_id:
        type: string
        required: true
quality:
  type: SodaCL
  specification:
    checks for orders:
      - row_count > 0
      - freshness(created_at) < 2h

Contract Testing

Contract testing verifies that data conforms to its contract. This can happen:

  • At write time: schema registry rejects non-conforming messages
  • In CI/CD: schema changes validated against compatibility rules before merge
  • At pipeline execution: dbt tests or Great Expectations validate against contract specs
  • Continuously: observability tools monitor contract SLO compliance

Organizational Implications

Data contracts shift power dynamics:

  • Producers can no longer make silent breaking changes
  • Consumers get predictability and can plan accordingly
  • A contract negotiation process is needed (who decides the schema?)
  • Versioning and deprecation policies must be defined

This is a governance challenge, not just a technical one.

When to Introduce Data Contracts

  • You have recurring pipeline breaks from upstream schema changes
  • Multiple teams consume the same data with different expectations
  • You are adopting Data Mesh and need interoperability guarantees
  • Regulatory requirements demand data lineage and auditability

Resources