tadata
Back to home

Change Data Capture: Patterns, Tools & Architecture

#data-engineering#cdc#databases#streaming#replication

Change Data Capture (CDC) is the backbone of modern data integration. Rather than repeatedly querying entire tables, CDC detects and propagates only the rows that changed. The result: lower latency, reduced database load, and a reliable audit trail of every mutation.

CDC Method Comparison

MethodMechanismLatencyDB LoadSchema AwarenessData Loss Risk
Log-basedReads database transaction log (WAL/binlog)Very low (seconds)Minimal — no query overheadHigh — captures DDL eventsLow
Trigger-basedDB triggers write changes to a shadow tableLow (near real-time)Medium — extra writes per DMLMedium — manual setupLow
Query-based (timestamp)Polls rows where updated_at > last_syncMedium (minutes)High — full scans possibleNoneMedium — deletes missed
Query-based (diff)Compares snapshots between runsHigh (hours)Very highNoneMedium
Dual-writeApplication writes to DB + event busVery lowLow on DBApplication-levelHigh — consistency risk

Log-based CDC is the industry standard for production systems. It reads the database's own replication stream, meaning zero impact on query performance and no schema changes required.

Tool Landscape

ToolTypeSourcesDestinationsPricing ModelBest For
DebeziumOpen source (Kafka Connect)PostgreSQL, MySQL, MongoDB, Oracle, SQL ServerKafka, PulsarFreeTeams with Kafka expertise
AWS DMSManaged service20+ relational/NoSQLS3, Kinesis, Redshift, RDSPer-instance-hourAWS-native migrations
FivetranSaaS300+ connectorsWarehouses, lakesPer-row pricingLow-ops teams
AirbyteOpen source / Cloud350+ connectorsWarehouses, lakesFree / usage-basedFlexibility + community
StriimEnterprise platformOracle, SAP, mainframesMulti-cloudLicenseLegacy modernization
ArcionCloud-native CDC20+ databasesWarehouses, lakesUsage-basedHigh-throughput replication

Use Case Decision Matrix

Use CaseRecommended MethodTool SuggestionKey Consideration
Real-time analytics dashboardLog-based CDCDebezium + KafkaSub-second latency required
Data warehouse sync (batch-tolerant)Query-based or log-basedFivetran, AirbyteSimplicity over latency
Microservice event sourcingLog-based CDC (outbox pattern)DebeziumTransactional outbox avoids dual-write
Legacy mainframe offloadTrigger-based or log-basedStriim, AWS DMSConnector availability
Multi-region replicationLog-based CDCDebezium, ArcionNetwork partitioning, conflict resolution
Compliance / audit trailLog-based CDCDebezium + immutable storeFull history retention

Reference Architecture: Log-Based CDC Pipeline

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Source DB   │────▶│  Debezium    │────▶│    Kafka     │────▶│   Sink       │
│  (Postgres)  │ WAL │  Connector   │ CDC │   Topics     │     │  Connector   │
│              │     │              │ JSON│              │     │              │
└──────────────┘     └──────────────┘     └──────┬───────┘     └──────┬───────┘
                                                 │                    │
                                          ┌──────▼───────┐    ┌──────▼───────┐
                                          │  Stream      │    │  Data        │
                                          │  Processing  │    │  Warehouse   │
                                          │  (Flink)     │    │  (Iceberg)   │
                                          └──────────────┘    └──────────────┘

Outbox Pattern for Microservices

The transactional outbox pattern solves the dual-write problem. Instead of writing to both a database and a message broker, the application writes domain events to an outbox table within the same transaction. CDC then tails that table and publishes events to Kafka.

┌─────────────────────────────────────────┐
│          Application Service            │
│                                         │
│  BEGIN TRANSACTION                      │
│    INSERT INTO orders (...)             │
│    INSERT INTO outbox (event_payload)   │
│  COMMIT                                 │
└──────────────────┬──────────────────────┘
                   │ WAL
          ┌────────▼────────┐
          │   CDC Connector  │
          └────────┬────────┘
                   │
          ┌────────▼────────┐
          │   Message Broker │
          │   (Kafka)        │
          └─────────────────┘

Key Metrics to Monitor

MetricTargetWhy It Matters
Replication lag< 5 secondsStale reads in downstream systems
Connector uptime> 99.9%Gaps cause data loss or duplicates
Event throughputDepends on workloadBottlenecks cascade downstream
Schema change failures0Breaking changes halt pipelines
Snapshot durationMinimizeInitial load blocks incremental sync

Resources

:::