tadata
Back to home

Streaming vs Batch Processing: Architecture Trade-offs

#data-engineering#streaming#batch#kafka#architecture

Not every data pipeline needs to be real-time. Choosing between streaming and batch — or combining both — depends on latency requirements, cost constraints, and operational complexity tolerance.

At a Glance

DimensionBatchStreaming
LatencyMinutes to hoursMilliseconds to seconds
Processing modelBounded datasets (files, partitions)Unbounded event streams
ComplexityLower — easier to debug, replay, testHigher — ordering, exactly-once, backpressure
CostPay per job (start/stop)Always-on infrastructure
Error recoveryReprocess the batchComplex (replay from offset, dead letter queues)
Typical toolsSpark, dbt, AirflowKafka, Flink, Kafka Streams, Kinesis

When Batch Is Enough

  • Daily/hourly reporting — dashboards refreshed on a schedule
  • Data warehouse loading — nightly ELT pipelines
  • ML model training — training on historical data, not live streams
  • Cost-sensitive workloads — batch Spark on spot instances is 5–10x cheaper than always-on Flink

When Streaming Is Required

  • Fraud detection — decisions in milliseconds
  • Real-time personalization — recommendations during a user session
  • Operational monitoring — alerting on live system metrics
  • Event-driven microservices — services reacting to domain events
  • IoT — continuous sensor data processing

Architecture Patterns

Lambda Architecture

Run both batch and streaming pipelines. Batch for accuracy (reprocessing), streaming for speed. Merge results in a serving layer. Downside: maintaining two codebases for the same logic.

Kappa Architecture

Streaming only. Reprocess by replaying the event log (Kafka retention). Simpler to maintain but requires robust stream processing infrastructure.

Medallion with Streaming

Bronze (raw events) → Silver (cleaned, deduplicated) → Gold (aggregated). Can be implemented with streaming (Flink/Spark Structured Streaming) or batch (dbt) at each layer.

Tool Landscape

ToolTypeStrengths
Apache KafkaEvent streaming platformDurable log, ecosystem (Connect, Streams, ksqlDB)
Apache FlinkStream processorTrue streaming, exactly-once, complex event processing
Spark Structured StreamingMicro-batch streamingUnified batch+stream API, wide adoption
Kafka StreamsLightweight stream libraryNo separate cluster needed, Java/Kotlin native
AWS KinesisManaged streamingAWS-native, serverless option (Data Firehose)
GCP DataflowManaged Beam runnerAuto-scaling, unified batch+stream
Apache BeamUnified programming modelWrite once, run on Flink/Spark/Dataflow

Decision Matrix

Your requirementRecommended approach
Latency > 1 hour acceptableBatch
Latency 1–15 minutesMicro-batch (Spark Structured Streaming)
Latency < 1 secondTrue streaming (Flink, Kafka Streams)
Both historical and live analysisLambda or Kappa
Budget-constrainedBatch first, add streaming only where needed

Resources

:::