tadata
Back to home

Real-Time Analytics: Engines, Architectures & Trade-offs

#analytics#real-time#streaming#architecture#data-engineering

Real-time analytics means different things to different teams. For some it is sub-second dashboards; for others it is anomaly detection on a streaming pipeline. The key is matching your latency requirements to the right engine and architecture pattern, without over-engineering.

Latency Tier Taxonomy

TierLatencyExample Use CaseTypical EngineData Freshness
Hard real-time< 100msFraud detection, ad biddingCustom (C++/Rust), FlinkEvent-level
Soft real-time100ms - 2sLive dashboards, monitoringClickHouse, Druid, PinotSeconds
Near real-time2s - 60sOperational reportingMaterialized views, RocksetSeconds to minutes
Micro-batch1 - 15 minKPI dashboards, alertingSpark Structured StreamingMinutes
Batch15 min - 24hDaily reports, ML trainingSpark, dbt, BigQueryHours

Most teams overestimate their latency needs. If your dashboard refreshes every 30 seconds, you do not need a sub-100ms engine. Start by defining the actual business SLA before choosing technology.

OLAP Engine Comparison

FeatureClickHouseApache DruidApache PinotRocksetDuckDB
ArchitectureShared-nothing MPPPre-aggregation + segmentsSegments + real-time tableConverged index (cloud)Embedded single-node
Ingestion modelBatch + KafkaBatch + Kafka/KinesisBatch + KafkaAuto-ingest (S3, Kafka, DynamoDB)File scan / attach
Query latency (p99)50ms - 2s200ms - 5s100ms - 3s50ms - 1s10ms - 30s (local)
ConcurrencyHigh (100s QPS)High (100s QPS)Very high (1000s QPS)HighLow (single-user)
SQL supportFull SQLSQL (Druid SQL)SQL (multi-stage)Full SQLFull SQL
Join capabilityStrong (recent versions)LimitedLimitedStrongStrong
Scaling modelAdd nodes (self-managed or cloud)Add historicalsAdd serversServerless auto-scaleVertical only
Cloud offeringClickHouse CloudImply, ConfluentStarTreeRockset (OpenAI acquired 2024)MotherDuck
Best forGeneral OLAP, high throughputTime-series + pre-aggUser-facing analytics at scaleLow-latency on semi-structuredLocal analytics, prototyping
LicenseApache 2.0Apache 2.0Apache 2.0ProprietaryMIT

Architecture Pattern: Kappa (Streaming-First)

┌──────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Events  │────▶│  Message     │────▶│  Stream      │────▶│  OLAP        │
│  Sources │     │  Broker      │     │  Processor   │     │  Engine      │
│          │     │  (Kafka)     │     │  (Flink)     │     │ (ClickHouse) │
└──────────┘     └──────────────┘     └──────────────┘     └──────┬───────┘
                                                                  │
                                                           ┌──────▼───────┐
                                                           │  Dashboard   │
                                                           │  / API       │
                                                           └──────────────┘

Architecture Pattern: Lambda (Batch + Speed)

                        ┌──────────────────────────────────────┐
                        │          Batch Layer                 │
┌──────────┐           │  (Spark → Iceberg → Query Engine)    │
│  Events  │───┬──────▶│                                      │──┐
│  Sources │   │       └──────────────────────────────────────┘  │  ┌──────────┐
└──────────┘   │                                                  ├─▶│  Serving  │
               │       ┌──────────────────────────────────────┐  │  │  Layer    │
               └──────▶│          Speed Layer                 │──┘  └──────────┘
                       │  (Kafka → Flink → ClickHouse)       │
                       └──────────────────────────────────────┘

Pre-Aggregation vs On-the-Fly: Decision Matrix

FactorPre-AggregationOn-the-FlyHybrid
Query latencyVery low (pre-computed)Medium to highLow
Data freshnessDepends on refresh cycleReal-timeMixed
Storage costHigher (materialized cubes)LowerMedium
Query flexibilityLimited to pre-defined dimensionsUnlimited ad-hocAd-hoc + fast defaults
ComplexityCube management, invalidationQuery optimizationBoth concerns
Best forKnown KPIs, executive dashboardsExploration, drill-downProduction + analysis
Example toolsDruid pre-agg, Cube.js, dbt metricsClickHouse, DuckDBMaterialized views + raw scan

Choosing Your Architecture

What is your query concurrency requirement?

├── < 10 concurrent users
│   ├── Data fits on one machine? → DuckDB / MotherDuck
│   └── Distributed needed? → ClickHouse (simple deploy)
│
├── 10 - 100 concurrent users
│   ├── Mostly time-series? → Apache Druid
│   └── General OLAP? → ClickHouse Cloud
│
└── 100+ concurrent (user-facing)
    ├── Sub-second p99 required? → Apache Pinot
    └── Semi-structured / document queries? → ClickHouse or Rockset

Key Metrics for Real-Time Systems

MetricTargetMonitoring Approach
Query p50 / p95 / p99 latencyDepends on tierEngine metrics + APM
Ingestion lag< freshness SLAKafka consumer lag
Query error rate< 0.1%Engine logs, alerting
Concurrent query countWithin capacityConnection pool monitoring
Storage growth ratePredictableCapacity planning dashboards

Resources

:::