Streaming vs Batch Processing: Architecture Trade-offs
#data-engineering#streaming#batch#kafka#architecture
Not every data pipeline needs to be real-time. Choosing between streaming and batch — or combining both — depends on latency requirements, cost constraints, and operational complexity tolerance.
At a Glance
| Dimension | Batch | Streaming |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Processing model | Bounded datasets (files, partitions) | Unbounded event streams |
| Complexity | Lower — easier to debug, replay, test | Higher — ordering, exactly-once, backpressure |
| Cost | Pay per job (start/stop) | Always-on infrastructure |
| Error recovery | Reprocess the batch | Complex (replay from offset, dead letter queues) |
| Typical tools | Spark, dbt, Airflow | Kafka, Flink, Kafka Streams, Kinesis |
When Batch Is Enough
- Daily/hourly reporting — dashboards refreshed on a schedule
- Data warehouse loading — nightly ELT pipelines
- ML model training — training on historical data, not live streams
- Cost-sensitive workloads — batch Spark on spot instances is 5–10x cheaper than always-on Flink
When Streaming Is Required
- Fraud detection — decisions in milliseconds
- Real-time personalization — recommendations during a user session
- Operational monitoring — alerting on live system metrics
- Event-driven microservices — services reacting to domain events
- IoT — continuous sensor data processing
Architecture Patterns
Lambda Architecture
Run both batch and streaming pipelines. Batch for accuracy (reprocessing), streaming for speed. Merge results in a serving layer. Downside: maintaining two codebases for the same logic.
Kappa Architecture
Streaming only. Reprocess by replaying the event log (Kafka retention). Simpler to maintain but requires robust stream processing infrastructure.
Medallion with Streaming
Bronze (raw events) → Silver (cleaned, deduplicated) → Gold (aggregated). Can be implemented with streaming (Flink/Spark Structured Streaming) or batch (dbt) at each layer.
Tool Landscape
| Tool | Type | Strengths |
|---|---|---|
| Apache Kafka | Event streaming platform | Durable log, ecosystem (Connect, Streams, ksqlDB) |
| Apache Flink | Stream processor | True streaming, exactly-once, complex event processing |
| Spark Structured Streaming | Micro-batch streaming | Unified batch+stream API, wide adoption |
| Kafka Streams | Lightweight stream library | No separate cluster needed, Java/Kotlin native |
| AWS Kinesis | Managed streaming | AWS-native, serverless option (Data Firehose) |
| GCP Dataflow | Managed Beam runner | Auto-scaling, unified batch+stream |
| Apache Beam | Unified programming model | Write once, run on Flink/Spark/Dataflow |
Decision Matrix
| Your requirement | Recommended approach |
|---|---|
| Latency > 1 hour acceptable | Batch |
| Latency 1–15 minutes | Micro-batch (Spark Structured Streaming) |
| Latency < 1 second | True streaming (Flink, Kafka Streams) |
| Both historical and live analysis | Lambda or Kappa |
| Budget-constrained | Batch first, add streaming only where needed |
Resources
- Streaming Systems — Tyler Akidau et al.
- Kafka: The Definitive Guide — Free from Confluent
- Designing Data-Intensive Applications — Chapter 11 on stream processing
- Flink documentation — Reference for stateful stream processing
:::