Change Data Capture: Patterns, Tools & Architecture
#data-engineering#cdc#databases#streaming#replication
Change Data Capture (CDC) is the backbone of modern data integration. Rather than repeatedly querying entire tables, CDC detects and propagates only the rows that changed. The result: lower latency, reduced database load, and a reliable audit trail of every mutation.
CDC Method Comparison
| Method | Mechanism | Latency | DB Load | Schema Awareness | Data Loss Risk |
|---|---|---|---|---|---|
| Log-based | Reads database transaction log (WAL/binlog) | Very low (seconds) | Minimal — no query overhead | High — captures DDL events | Low |
| Trigger-based | DB triggers write changes to a shadow table | Low (near real-time) | Medium — extra writes per DML | Medium — manual setup | Low |
| Query-based (timestamp) | Polls rows where updated_at > last_sync | Medium (minutes) | High — full scans possible | None | Medium — deletes missed |
| Query-based (diff) | Compares snapshots between runs | High (hours) | Very high | None | Medium |
| Dual-write | Application writes to DB + event bus | Very low | Low on DB | Application-level | High — consistency risk |
Log-based CDC is the industry standard for production systems. It reads the database's own replication stream, meaning zero impact on query performance and no schema changes required.
Tool Landscape
| Tool | Type | Sources | Destinations | Pricing Model | Best For |
|---|---|---|---|---|---|
| Debezium | Open source (Kafka Connect) | PostgreSQL, MySQL, MongoDB, Oracle, SQL Server | Kafka, Pulsar | Free | Teams with Kafka expertise |
| AWS DMS | Managed service | 20+ relational/NoSQL | S3, Kinesis, Redshift, RDS | Per-instance-hour | AWS-native migrations |
| Fivetran | SaaS | 300+ connectors | Warehouses, lakes | Per-row pricing | Low-ops teams |
| Airbyte | Open source / Cloud | 350+ connectors | Warehouses, lakes | Free / usage-based | Flexibility + community |
| Striim | Enterprise platform | Oracle, SAP, mainframes | Multi-cloud | License | Legacy modernization |
| Arcion | Cloud-native CDC | 20+ databases | Warehouses, lakes | Usage-based | High-throughput replication |
Use Case Decision Matrix
| Use Case | Recommended Method | Tool Suggestion | Key Consideration |
|---|---|---|---|
| Real-time analytics dashboard | Log-based CDC | Debezium + Kafka | Sub-second latency required |
| Data warehouse sync (batch-tolerant) | Query-based or log-based | Fivetran, Airbyte | Simplicity over latency |
| Microservice event sourcing | Log-based CDC (outbox pattern) | Debezium | Transactional outbox avoids dual-write |
| Legacy mainframe offload | Trigger-based or log-based | Striim, AWS DMS | Connector availability |
| Multi-region replication | Log-based CDC | Debezium, Arcion | Network partitioning, conflict resolution |
| Compliance / audit trail | Log-based CDC | Debezium + immutable store | Full history retention |
Reference Architecture: Log-Based CDC Pipeline
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Source DB │────▶│ Debezium │────▶│ Kafka │────▶│ Sink │
│ (Postgres) │ WAL │ Connector │ CDC │ Topics │ │ Connector │
│ │ │ │ JSON│ │ │ │
└──────────────┘ └──────────────┘ └──────┬───────┘ └──────┬───────┘
│ │
┌──────▼───────┐ ┌──────▼───────┐
│ Stream │ │ Data │
│ Processing │ │ Warehouse │
│ (Flink) │ │ (Iceberg) │
└──────────────┘ └──────────────┘
Outbox Pattern for Microservices
The transactional outbox pattern solves the dual-write problem. Instead of writing to both a database and a message broker, the application writes domain events to an outbox table within the same transaction. CDC then tails that table and publishes events to Kafka.
┌─────────────────────────────────────────┐
│ Application Service │
│ │
│ BEGIN TRANSACTION │
│ INSERT INTO orders (...) │
│ INSERT INTO outbox (event_payload) │
│ COMMIT │
└──────────────────┬──────────────────────┘
│ WAL
┌────────▼────────┐
│ CDC Connector │
└────────┬────────┘
│
┌────────▼────────┐
│ Message Broker │
│ (Kafka) │
└─────────────────┘
Key Metrics to Monitor
| Metric | Target | Why It Matters |
|---|---|---|
| Replication lag | < 5 seconds | Stale reads in downstream systems |
| Connector uptime | > 99.9% | Gaps cause data loss or duplicates |
| Event throughput | Depends on workload | Bottlenecks cascade downstream |
| Schema change failures | 0 | Breaking changes halt pipelines |
| Snapshot duration | Minimize | Initial load blocks incremental sync |
Resources
- Debezium Documentation
- Martin Kleppmann — "Turning the Database Inside-Out"
- AWS DMS Best Practices
- Gunnar Morling — Outbox Pattern
:::