Data downtime is the period when data is missing, inaccurate, or otherwise unusable. Unlike application downtime, which triggers immediate alerts, data downtime often goes undetected for hours or days. By the time someone notices a wrong number on a dashboard, decisions have already been made on faulty information. The cost is real and measurable.
Downtime Cost Calculation Framework
| Cost Category | Formula | Example (Mid-size company) |
|---|
| Revenue impact | (Wrong decisions x avg decision value) / incident | 50K−500K per major incident |
| Labor cost | (Engineers x hours x hourly rate) per incident | 3 engineers x 8h x 100=2,400 |
| Opportunity cost | Projects delayed while firefighting | 10K−50K per week of delay |
| Trust erosion | Qualitative: stakeholders stop using data | Unmeasurable but compounding |
| Compliance risk | Regulatory fines for incorrect reporting | 10K−10M depending on sector |
| SLA penalties | Contractual penalties to data consumers | Per contract terms |
Industry benchmarks: Gartner estimates poor data quality costs organizations an average of $12.9M annually. Data downtime specifically accounts for 25-40% of a data team's time.
Incident Category Taxonomy
Data Downtime Incidents
+-- Freshness
| +-- Pipeline delayed or stuck
| +-- Source system late delivery
| +-- Orchestrator failure (DAG stuck)
+-- Volume
| +-- Unexpected row count drop/spike
| +-- Partial load (missing partitions)
| +-- Duplicate records
+-- Schema
| +-- Column added/removed/renamed upstream
| +-- Data type change breaking transforms
| +-- Encoding change (UTF-8 issues)
+-- Distribution
| +-- Metric drift (gradual or sudden)
| +-- NULL rate spike
| +-- Categorical value shift
+-- Lineage
| +-- Broken dependency chain
| +-- Orphaned tables still consumed
| +-- Circular dependencies
Data SLA Template
| SLA Dimension | Tier 1 (Critical) | Tier 2 (Important) | Tier 3 (Standard) |
|---|
| Freshness | Updated within 15 min | Updated within 1 hour | Updated within 24 hours |
| Completeness | 99.9% of expected rows | 99% of expected rows | 95% of expected rows |
| Accuracy | < 0.01% error rate | < 0.1% error rate | < 1% error rate |
| Availability | 99.9% uptime | 99.5% uptime | 99% uptime |
| Response time | Incident response < 15 min | Response < 1 hour | Response < 4 hours |
| Resolution time | Resolved < 2 hours | Resolved < 8 hours | Resolved < 48 hours |
| Examples | Financial reporting, ML features | Executive dashboards, KPIs | Exploratory datasets, archives |
Impact Matrix by Data Consumer Type
| Consumer Type | Impact of Wrong Data | Impact of Late Data | Impact of Missing Data | Sensitivity |
|---|
| C-suite / Board | Strategic missteps | Delayed decisions | Loss of confidence | Very High |
| Finance / Accounting | Regulatory violations | Late filings | Audit failures | Very High |
| ML Models in Production | Wrong predictions, bad UX | Stale features, drift | Model failures | High |
| Product Managers | Wrong prioritization | Missed windows | Guesswork | High |
| Marketing | Wasted spend, wrong targeting | Missed campaign timing | No attribution | Medium-High |
| Operations | Process inefficiency | Delayed response | Manual workarounds | Medium |
| Data Analysts | Incorrect reports | Delayed insights | Blocked work | Medium |
| External Clients (data products) | SLA breach, churn | Trust erosion | Contract violation | Very High |
Tool Comparison for Detection
| Tool | Type | Freshness | Volume | Schema | Distribution | Lineage | Pricing Model |
|---|
| Monte Carlo | Commercial | Yes | Yes | Yes | Yes | Yes | Per table monitored |
| Anomalo | Commercial | Yes | Yes | Limited | Yes | No | Per table |
| Elementary | OSS (dbt) | Yes | Yes | Yes | Yes | Via dbt | Free |
| Great Expectations | OSS | No | Yes | Yes | Yes | No | Free |
| Soda Core | OSS | Yes | Yes | Yes | Yes | No | Free |
| dbt tests | OSS | Limited | Yes | Yes | Limited | Via dbt | Free |
| Datafold | Commercial | Yes | Yes | Yes | Yes | Yes | Per user |
| Bigeye | Commercial | Yes | Yes | Yes | Yes | Limited | Per table |
Building a Data Reliability Practice
The most effective approach is layered: dbt tests catch known issues at transform time, Great Expectations or Soda validate data contracts at ingestion boundaries, and an observability tool like Elementary or Monte Carlo provides anomaly detection across the full pipeline. Pair this with clear SLA tiers, on-call rotations for Tier 1 data, and incident post-mortems that feed back into test coverage.
Resources