tadata
Back to home

ML Model Monitoring: Catching Silent Failures

#machine-learning#monitoring#mlops#observability

A model that was 95% accurate at deployment can silently degrade to 70% without anyone noticing. Traditional software monitoring (uptime, latency, errors) does not catch ML-specific failures. Model monitoring is the practice of detecting when your model stops working, before your users do.

Drift Type Taxonomy

Model Degradation
├── Data Drift (input distribution changes)
│   ├── Covariate shift (feature distributions change)
│   ├── Prior probability shift (target distribution changes)
│   └── Feature schema drift (new categories, missing features)
├── Concept Drift (relationship between X and Y changes)
│   ├── Gradual drift (slow trend over months)
│   ├── Sudden drift (abrupt change, e.g., policy change)
│   ├── Recurring drift (seasonal patterns)
│   └── Incremental drift (small accumulating changes)
├── Prediction Drift (output distribution changes)
│   └── Often the first visible symptom of data or concept drift
└── Upstream Data Quality Issues
    ├── Missing values increase
    ├── Schema changes from data providers
    └── ETL pipeline failures

Monitoring Metric Matrix

Metric CategoryMetricWhat It DetectsWhen to Use
Data QualityNull rate, type mismatchesUpstream pipeline issuesAlways
Data DriftPSI, KS test, Jensen-ShannonFeature distribution changesAlways
Prediction DriftOutput distribution divergenceModel behavior changesWhen ground truth is delayed
Model PerformanceAccuracy, F1, AUC, RMSEActual degradationWhen ground truth is available
Bias/FairnessDemographic parity, equalized oddsFairness degradation by cohortRegulated/high-stakes models
OperationalLatency p50/p95/p99, throughput, error rateInfrastructure/serving issuesAlways
BusinessRevenue impact, conversion changeBusiness outcome degradationHigh-value models

Tool Comparison

FeatureEvidently AIWhyLabsArizeFiddlerNannyML
TypeOpen source + cloudCloud (open source agent)CloudCloudOpen source + cloud
Data DriftYes (statistical tests)Yes (approximate profiles)YesYesYes
Concept DriftYesYesYesYesYes (CBPE method)
Performance EstimationBasicBasicBasicBasicAdvanced (no ground truth)
ExplainabilitySHAP integrationSHAPSHAP, feature importanceNative XAILimited
LLM MonitoringYes (text metrics)YesYesYesNo
AlertingCustom thresholdsAnomaly-basedAnomaly-basedCustomCustom
PricingFree (OSS) / paid cloudFree tier + paidPaidPaidFree (OSS) / paid
Best ForTeams wanting OSS controlHigh-volume streamingFull-stack observabilityRegulated industriesPerformance without labels

Alert Threshold Guidelines

MetricGreenYellow (Investigate)Red (Action Required)
Data Drift (PSI)< 0.10.1 - 0.25> 0.25
Null Rate< baseline + 1%baseline + 1-5%> baseline + 5%
Prediction Distribution Shift< 0.05 (KS)0.05 - 0.1> 0.1
Accuracy Drop< 2% from baseline2-5% from baseline> 5% from baseline
Latency (p99)< 2x SLA2-5x SLA> 5x SLA
Error Rate< 0.1%0.1-1%> 1%

Monitoring Architecture

Model Serving Layer
        |
        v
+------------------+     +------------------+
| Request Logger   |---->| Feature Store    |
| (inputs/outputs) |     | (expected ranges)|
+------------------+     +------------------+
        |                         |
        v                         v
+------------------------------------------+
|         Monitoring Platform               |
|  +------------+  +------------------+    |
|  | Drift      |  | Performance      |    |
|  | Detection  |  | Tracking         |    |
|  +------------+  +------------------+    |
|  +------------+  +------------------+    |
|  | Data       |  | Business Metric  |    |
|  | Quality    |  | Correlation      |    |
|  +------------+  +------------------+    |
+------------------------------------------+
        |
        v
+------------------+     +------------------+
| Alert System     |---->| Retraining       |
| (PagerDuty, etc) |     | Pipeline Trigger |
+------------------+     +------------------+

Strategic Recommendations

  1. Start with data quality monitoring. Most production ML failures are data pipeline issues, not model issues.
  2. Monitor predictions when you lack ground truth. Prediction drift is a leading indicator that something changed.
  3. Set thresholds based on business impact, not statistical purity. A 3% accuracy drop on a spam filter is fine; on a medical diagnosis model it is not.
  4. Automate retraining triggers, but keep human approval. Fully autonomous retraining loops are risky until you trust your pipeline deeply.
  5. Log everything. Inputs, outputs, latencies, feature values. You cannot debug what you did not record.

Resources