Real-Time Analytics: Engines, Architectures & Trade-offs
#analytics#real-time#streaming#architecture#data-engineering
Real-time analytics means different things to different teams. For some it is sub-second dashboards; for others it is anomaly detection on a streaming pipeline. The key is matching your latency requirements to the right engine and architecture pattern, without over-engineering.
Latency Tier Taxonomy
| Tier | Latency | Example Use Case | Typical Engine | Data Freshness |
|---|---|---|---|---|
| Hard real-time | < 100ms | Fraud detection, ad bidding | Custom (C++/Rust), Flink | Event-level |
| Soft real-time | 100ms - 2s | Live dashboards, monitoring | ClickHouse, Druid, Pinot | Seconds |
| Near real-time | 2s - 60s | Operational reporting | Materialized views, Rockset | Seconds to minutes |
| Micro-batch | 1 - 15 min | KPI dashboards, alerting | Spark Structured Streaming | Minutes |
| Batch | 15 min - 24h | Daily reports, ML training | Spark, dbt, BigQuery | Hours |
Most teams overestimate their latency needs. If your dashboard refreshes every 30 seconds, you do not need a sub-100ms engine. Start by defining the actual business SLA before choosing technology.
OLAP Engine Comparison
| Feature | ClickHouse | Apache Druid | Apache Pinot | Rockset | DuckDB |
|---|---|---|---|---|---|
| Architecture | Shared-nothing MPP | Pre-aggregation + segments | Segments + real-time table | Converged index (cloud) | Embedded single-node |
| Ingestion model | Batch + Kafka | Batch + Kafka/Kinesis | Batch + Kafka | Auto-ingest (S3, Kafka, DynamoDB) | File scan / attach |
| Query latency (p99) | 50ms - 2s | 200ms - 5s | 100ms - 3s | 50ms - 1s | 10ms - 30s (local) |
| Concurrency | High (100s QPS) | High (100s QPS) | Very high (1000s QPS) | High | Low (single-user) |
| SQL support | Full SQL | SQL (Druid SQL) | SQL (multi-stage) | Full SQL | Full SQL |
| Join capability | Strong (recent versions) | Limited | Limited | Strong | Strong |
| Scaling model | Add nodes (self-managed or cloud) | Add historicals | Add servers | Serverless auto-scale | Vertical only |
| Cloud offering | ClickHouse Cloud | Imply, Confluent | StarTree | Rockset (OpenAI acquired 2024) | MotherDuck |
| Best for | General OLAP, high throughput | Time-series + pre-agg | User-facing analytics at scale | Low-latency on semi-structured | Local analytics, prototyping |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 | Proprietary | MIT |
Architecture Pattern: Kappa (Streaming-First)
┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Events │────▶│ Message │────▶│ Stream │────▶│ OLAP │
│ Sources │ │ Broker │ │ Processor │ │ Engine │
│ │ │ (Kafka) │ │ (Flink) │ │ (ClickHouse) │
└──────────┘ └──────────────┘ └──────────────┘ └──────┬───────┘
│
┌──────▼───────┐
│ Dashboard │
│ / API │
└──────────────┘
Architecture Pattern: Lambda (Batch + Speed)
┌──────────────────────────────────────┐
│ Batch Layer │
┌──────────┐ │ (Spark → Iceberg → Query Engine) │
│ Events │───┬──────▶│ │──┐
│ Sources │ │ └──────────────────────────────────────┘ │ ┌──────────┐
└──────────┘ │ ├─▶│ Serving │
│ ┌──────────────────────────────────────┐ │ │ Layer │
└──────▶│ Speed Layer │──┘ └──────────┘
│ (Kafka → Flink → ClickHouse) │
└──────────────────────────────────────┘
Pre-Aggregation vs On-the-Fly: Decision Matrix
| Factor | Pre-Aggregation | On-the-Fly | Hybrid |
|---|---|---|---|
| Query latency | Very low (pre-computed) | Medium to high | Low |
| Data freshness | Depends on refresh cycle | Real-time | Mixed |
| Storage cost | Higher (materialized cubes) | Lower | Medium |
| Query flexibility | Limited to pre-defined dimensions | Unlimited ad-hoc | Ad-hoc + fast defaults |
| Complexity | Cube management, invalidation | Query optimization | Both concerns |
| Best for | Known KPIs, executive dashboards | Exploration, drill-down | Production + analysis |
| Example tools | Druid pre-agg, Cube.js, dbt metrics | ClickHouse, DuckDB | Materialized views + raw scan |
Choosing Your Architecture
What is your query concurrency requirement?
├── < 10 concurrent users
│ ├── Data fits on one machine? → DuckDB / MotherDuck
│ └── Distributed needed? → ClickHouse (simple deploy)
│
├── 10 - 100 concurrent users
│ ├── Mostly time-series? → Apache Druid
│ └── General OLAP? → ClickHouse Cloud
│
└── 100+ concurrent (user-facing)
├── Sub-second p99 required? → Apache Pinot
└── Semi-structured / document queries? → ClickHouse or Rockset
Key Metrics for Real-Time Systems
| Metric | Target | Monitoring Approach |
|---|---|---|
| Query p50 / p95 / p99 latency | Depends on tier | Engine metrics + APM |
| Ingestion lag | < freshness SLA | Kafka consumer lag |
| Query error rate | < 0.1% | Engine logs, alerting |
| Concurrent query count | Within capacity | Connection pool monitoring |
| Storage growth rate | Predictable | Capacity planning dashboards |
Resources
- ClickHouse Documentation
- Apache Pinot — Real-Time OLAP
- Dunith Dhanushka — Kappa vs Lambda Architecture
- Druid vs ClickHouse vs Pinot — StarTree Comparison
:::