A model that was 95% accurate at deployment can silently degrade to 70% without anyone noticing. Traditional software monitoring (uptime, latency, errors) does not catch ML-specific failures. Model monitoring is the practice of detecting when your model stops working, before your users do.
Drift Type Taxonomy
Model Degradation
├── Data Drift (input distribution changes)
│ ├── Covariate shift (feature distributions change)
│ ├── Prior probability shift (target distribution changes)
│ └── Feature schema drift (new categories, missing features)
├── Concept Drift (relationship between X and Y changes)
│ ├── Gradual drift (slow trend over months)
│ ├── Sudden drift (abrupt change, e.g., policy change)
│ ├── Recurring drift (seasonal patterns)
│ └── Incremental drift (small accumulating changes)
├── Prediction Drift (output distribution changes)
│ └── Often the first visible symptom of data or concept drift
└── Upstream Data Quality Issues
├── Missing values increase
├── Schema changes from data providers
└── ETL pipeline failures
Monitoring Metric Matrix
| Metric Category | Metric | What It Detects | When to Use |
|---|
| Data Quality | Null rate, type mismatches | Upstream pipeline issues | Always |
| Data Drift | PSI, KS test, Jensen-Shannon | Feature distribution changes | Always |
| Prediction Drift | Output distribution divergence | Model behavior changes | When ground truth is delayed |
| Model Performance | Accuracy, F1, AUC, RMSE | Actual degradation | When ground truth is available |
| Bias/Fairness | Demographic parity, equalized odds | Fairness degradation by cohort | Regulated/high-stakes models |
| Operational | Latency p50/p95/p99, throughput, error rate | Infrastructure/serving issues | Always |
| Business | Revenue impact, conversion change | Business outcome degradation | High-value models |
Tool Comparison
| Feature | Evidently AI | WhyLabs | Arize | Fiddler | NannyML |
|---|
| Type | Open source + cloud | Cloud (open source agent) | Cloud | Cloud | Open source + cloud |
| Data Drift | Yes (statistical tests) | Yes (approximate profiles) | Yes | Yes | Yes |
| Concept Drift | Yes | Yes | Yes | Yes | Yes (CBPE method) |
| Performance Estimation | Basic | Basic | Basic | Basic | Advanced (no ground truth) |
| Explainability | SHAP integration | SHAP | SHAP, feature importance | Native XAI | Limited |
| LLM Monitoring | Yes (text metrics) | Yes | Yes | Yes | No |
| Alerting | Custom thresholds | Anomaly-based | Anomaly-based | Custom | Custom |
| Pricing | Free (OSS) / paid cloud | Free tier + paid | Paid | Paid | Free (OSS) / paid |
| Best For | Teams wanting OSS control | High-volume streaming | Full-stack observability | Regulated industries | Performance without labels |
Alert Threshold Guidelines
| Metric | Green | Yellow (Investigate) | Red (Action Required) |
|---|
| Data Drift (PSI) | < 0.1 | 0.1 - 0.25 | > 0.25 |
| Null Rate | < baseline + 1% | baseline + 1-5% | > baseline + 5% |
| Prediction Distribution Shift | < 0.05 (KS) | 0.05 - 0.1 | > 0.1 |
| Accuracy Drop | < 2% from baseline | 2-5% from baseline | > 5% from baseline |
| Latency (p99) | < 2x SLA | 2-5x SLA | > 5x SLA |
| Error Rate | < 0.1% | 0.1-1% | > 1% |
Monitoring Architecture
Model Serving Layer
|
v
+------------------+ +------------------+
| Request Logger |---->| Feature Store |
| (inputs/outputs) | | (expected ranges)|
+------------------+ +------------------+
| |
v v
+------------------------------------------+
| Monitoring Platform |
| +------------+ +------------------+ |
| | Drift | | Performance | |
| | Detection | | Tracking | |
| +------------+ +------------------+ |
| +------------+ +------------------+ |
| | Data | | Business Metric | |
| | Quality | | Correlation | |
| +------------+ +------------------+ |
+------------------------------------------+
|
v
+------------------+ +------------------+
| Alert System |---->| Retraining |
| (PagerDuty, etc) | | Pipeline Trigger |
+------------------+ +------------------+
Strategic Recommendations
- Start with data quality monitoring. Most production ML failures are data pipeline issues, not model issues.
- Monitor predictions when you lack ground truth. Prediction drift is a leading indicator that something changed.
- Set thresholds based on business impact, not statistical purity. A 3% accuracy drop on a spam filter is fine; on a medical diagnosis model it is not.
- Automate retraining triggers, but keep human approval. Fully autonomous retraining loops are risky until you trust your pipeline deeply.
- Log everything. Inputs, outputs, latencies, feature values. You cannot debug what you did not record.
Resources