Data Contracts: Schema Agreements Between Producers and Consumers
The Problem Data Contracts Solve
In most organizations, data producers (application teams) and data consumers (analytics, ML teams) have no formal agreement. The application team changes a column name, adds a field, or modifies an enum value, and downstream pipelines silently break.
Data contracts establish an explicit agreement between producers and consumers about the shape, semantics, and quality of data.
What a Data Contract Contains
| Element | Description | Example |
|---|---|---|
| Schema | Field names, types, nullability | user_id: string, not null |
| Semantics | Business meaning of each field | "user_id is the globally unique identifier from auth system" |
| Quality guarantees | SLOs for freshness, completeness, validity | "Updated within 1 hour, < 0.01% nulls on required fields" |
| Ownership | Team responsible for the contract | "Team: User Platform, Slack: #user-data" |
| Versioning | How the contract evolves | "Semantic versioning: breaking = major, additive = minor" |
| Access policies | Who can consume, under what conditions | "PII fields restricted to compliance-approved consumers" |
Breaking vs Non-Breaking Changes
| Change Type | Breaking | Non-Breaking |
|---|---|---|
| Remove a field | Yes | -- |
| Rename a field | Yes | -- |
| Change field type | Yes | -- |
| Add a new optional field | -- | Yes |
| Widen a type (int32 to int64) | Depends | Usually yes |
| Add an enum value | Depends | If consumers handle unknown values |
| Tighten nullability (nullable to not-null) | -- | Yes |
| Loosen nullability (not-null to nullable) | Yes | -- |
Breaking changes require consumer coordination. Non-breaking changes can be deployed independently.
Schema Registries
Schema registries enforce contract compliance at runtime:
| Registry | Format Support | Integration |
|---|---|---|
| Confluent Schema Registry | Avro, Protobuf, JSON Schema | Kafka-native |
| AWS Glue Schema Registry | Avro, JSON Schema | AWS ecosystem |
| Apicurio Registry | Avro, Protobuf, JSON Schema, OpenAPI | Open source, vendor-neutral |
| Buf | Protobuf | gRPC/Protobuf workflows |
Schema registries enforce compatibility rules (backward, forward, full) automatically. A producer cannot publish a breaking change without explicitly bumping the version.
The Data Contract Specification
The open-source Data Contract Specification (datacontract.com) provides a YAML-based format:
dataContractSpecification: 0.9.3
id: urn:datacontract:checkout:orders
info:
title: Orders
version: 1.0.0
owner: checkout-team
servers:
production:
type: snowflake
account: xyz
models:
orders:
fields:
order_id:
type: string
required: true
unique: true
customer_id:
type: string
required: true
quality:
type: SodaCL
specification:
checks for orders:
- row_count > 0
- freshness(created_at) < 2h
Contract Testing
Contract testing verifies that data conforms to its contract. This can happen:
- At write time: schema registry rejects non-conforming messages
- In CI/CD: schema changes validated against compatibility rules before merge
- At pipeline execution: dbt tests or Great Expectations validate against contract specs
- Continuously: observability tools monitor contract SLO compliance
Organizational Implications
Data contracts shift power dynamics:
- Producers can no longer make silent breaking changes
- Consumers get predictability and can plan accordingly
- A contract negotiation process is needed (who decides the schema?)
- Versioning and deprecation policies must be defined
This is a governance challenge, not just a technical one.
When to Introduce Data Contracts
- You have recurring pipeline breaks from upstream schema changes
- Multiple teams consume the same data with different expectations
- You are adopting Data Mesh and need interoperability guarantees
- Regulatory requirements demand data lineage and auditability