Feature Stores: The Missing Infrastructure for ML
#machine-learning#feature-store#data-engineering#mlops
Feature stores solve one of the most underestimated problems in ML: getting consistent, fresh, and correct features to models in both training and serving. Without one, teams rebuild the same transformations, introduce training-serving skew, and waste months on plumbing.
The Core Problem
WITHOUT Feature Store: WITH Feature Store:
Training Pipeline Training Pipeline
└── SQL query A └── Feature Store (offline)
Serving Pipeline Serving Pipeline
└── Python transform B └── Feature Store (online)
(different logic!) (same logic, same data)
Result: Training-Serving Skew Result: Consistency guaranteed
Online vs Offline: Two Serving Patterns
| Dimension | Offline Store | Online Store |
|---|---|---|
| Purpose | Training data, batch scoring | Real-time inference |
| Latency | Seconds to minutes | < 10 ms |
| Storage | Data lake / warehouse (Parquet, Delta) | Key-value store (Redis, DynamoDB) |
| Data Volume | Months/years of history | Latest values only |
| Access Pattern | Full scan, time-travel queries | Point lookup by entity key |
| Freshness | Batch (hourly/daily) | Near real-time (streaming) |
| Cost Profile | Storage-heavy, compute on read | Compute-heavy, storage is small |
Tool Comparison
| Feature | Feast | Tecton | Hopsworks | Vertex Feature Store |
|---|---|---|---|---|
| Type | Open source | Commercial | Open core | Managed (GCP) |
| Online Store | Redis, DynamoDB, etc. | Managed | RonDB | Bigtable |
| Offline Store | BigQuery, Redshift, etc. | Managed | Hudi on S3 | BigQuery |
| Streaming | Via Spark/Flink | Native | Native | Dataflow |
| Feature Transformation | Limited (push-based) | Native (Python SDK) | Native (PySpark/SQL) | Dataflow pipelines |
| Monitoring | Basic | Built-in drift detection | Built-in | Basic |
| Registry | File or DB-backed | Managed | Managed | Managed |
| Cost | Free + infra | $$$$ (enterprise) | Free tier + managed | GCP pricing |
| Best For | Simple needs, multi-cloud | High-scale real-time ML | Full ML platform | GCP-native teams |
Feature Store Architecture
Data Sources Feature Store Consumers
+----------------+ +---------------------+ +----------------+
| Event Streams |-----> | Transformation | | Training |
| (Kafka, Kinesis)| | Engine | | Pipelines |
+----------------+ | | | | +----------------+
| v v |
+----------------+ | +-------+ +-------+ | +----------------+
| Databases |-----> | |Offline| |Online | |------> | Real-time |
| (Postgres, etc)| | |Store | |Store | | | Serving |
+----------------+ | +-------+ +-------+ | +----------------+
| | |
+----------------+ | +------------------+ | +----------------+
| Data Warehouse |-----> | |Feature Registry | |------> | Batch |
| (BigQuery, etc)| | |& Metadata | | | Scoring |
+----------------+ +---------------------+ +----------------+
Adoption Decision Matrix
| Signal | Score (1-5) | Weight |
|---|---|---|
| Number of ML models in production | ___ | 3x |
| Teams sharing features across models | ___ | 3x |
| Training-serving skew incidents | ___ | 2x |
| Time spent on feature engineering plumbing | ___ | 2x |
| Real-time serving requirements | ___ | 2x |
| Data freshness requirements (streaming) | ___ | 1x |
Score interpretation: Weighted total > 40: strong need. 25-40: evaluate. < 25: premature.
When NOT to Build a Feature Store
- You have fewer than 3 models in production
- All your models are batch-only (no real-time serving)
- A single team owns all ML and does not share features
- Your features are simple lookups with no transformation