SLAs, SLOs & SLIs: Building Reliability Targets That Actually Work
Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) form the backbone of modern reliability engineering. Getting them right is the difference between an SRE team drowning in alerts and one that ships confidently.
At a Glance
| Concept | What it is | Who owns it | Example |
|---|---|---|---|
| SLI | A measured metric of service behavior | Engineering | 99.2 % of requests < 300 ms |
| SLO | An internal reliability target on an SLI | Engineering + Product | 99.5 % success rate over 30 days |
| SLA | A contractual commitment with consequences | Business / Legal | 99.9 % uptime or credits issued |
| Error budget | Allowed unreliability = 1 − SLO | Engineering + Product | 0.5 % failure budget per window |
Why This Matters
Without explicit reliability targets, teams default to "make everything 100 % available" — which is both impossible and incredibly expensive. SLOs give you a principled way to balance reliability with velocity: if you have error budget left, ship faster; if you've burned it, slow down and stabilize.
"Hope is not a strategy." — Google SRE Book
Step 1 — Choose the Right SLIs
SLIs must reflect what users actually experience. Common categories:
Availability
The proportion of valid requests that succeed.
SLI = (successful requests) / (total valid requests)
"Successful" typically means HTTP 2xx/3xx responses, excluding 4xx (client errors). Be explicit about what counts.
Latency
The proportion of requests faster than a threshold.
SLI = (requests < 300ms) / (total requests)
Use percentile-based thresholds (p50, p95, p99) rather than averages. A mean latency of 200 ms can hide a p99 of 5 seconds.
Correctness / Quality
The proportion of responses that return the right data.
SLI = (correct responses) / (total responses)
Relevant for data pipelines, ML inference, search results, and financial systems.
Freshness
The proportion of data that is updated within an acceptable delay.
SLI = (records updated within threshold) / (total records)
Critical for dashboards, caches, and event-driven systems.
Throughput
The proportion of time the system processes above a minimum rate.
SLI = (minutes above X req/s) / (total minutes)
Useful for batch processing, streaming pipelines, and ETL jobs.
Tips for choosing SLIs
- Measure at the edge, not internally. A healthy server behind a broken load balancer still fails users.
- Fewer is better. 2–4 SLIs per service. If you can't explain them in one sentence each, simplify.
- Use ratios. "Good events / total events" is the universal SLI format. It normalizes across traffic volume.
Step 2 — Set SLO Targets
An SLO is a target value for an SLI over a time window.
SLO: 99.5 % of requests succeed over a 30-day rolling window
Choosing the right number
| Target | Downtime / 30 days | Error budget | Typical use case |
|---|---|---|---|
| 99 % | ~7.2 hours | 1 % | Internal tools, dev environments |
| 99.5 % | ~3.6 hours | 0.5 % | B2B SaaS, non-critical APIs |
| 99.9 % | ~43 minutes | 0.1 % | Core product APIs, checkout flows |
| 99.95 % | ~22 minutes | 0.05 % | Payment processing, auth services |
| 99.99 % | ~4.3 minutes | 0.01 % | Infrastructure (DNS, load balancers) |
Rules of thumb
- Start lower than you think. It's easy to tighten an SLO, painful to loosen one — especially if it's in an SLA.
- Match dependencies. Your SLO can't exceed the SLO of your least reliable dependency.
- Separate by user journey. Login might need 99.9 %; a reporting dashboard might only need 99 %.
- Use rolling windows (e.g., 30 days) rather than calendar months. Calendar windows create end-of-month panic and reset incentives.
Step 3 — Compute Error Budgets
Error budget = 1 − SLO target
For a 99.5 % SLO over 30 days:
- Budget = 0.5 % of total requests
- If you serve 10 M requests/month → 50,000 failures allowed
Error budget policies
Define in advance what happens when budget is consumed:
| Budget remaining | Action |
|---|---|
| > 50 % | Ship freely, run experiments |
| 25–50 % | Proceed with caution, review risky deploys |
| 5–25 % | Feature freeze for the service, focus on reliability |
| 0 % | All hands on reliability, rollback recent changes |
Write this down as a team agreement, signed by engineering and product leadership. Without a policy, error budgets are just dashboards nobody acts on.
Step 4 — Instrument & Measure
OpenTelemetry (recommended)
Use OTel to emit metrics at the application level:
from opentelemetry import metrics
meter = metrics.get_meter("my-service")
request_counter = meter.create_counter(
"http.server.request.count",
description="Total HTTP requests",
)
error_counter = meter.create_counter(
"http.server.error.count",
description="Failed HTTP requests",
)
# In your request handler:
request_counter.add(1, {"method": "GET", "route": "/api/orders"})
if response.status >= 500:
error_counter.add(1, {"method": "GET", "route": "/api/orders"})
Prometheus
If you're already on Prometheus, use histograms for latency SLIs:
# prometheus recording rules
groups:
- name: slo_rules
rules:
- record: sli:availability:ratio_30d
expr: |
1 - (
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)
- record: sli:latency:ratio_30d
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
Cloud-native options
| Provider | SLO tooling |
|---|---|
| GCP | Service Monitoring — native SLO definition on Cloud Run, GKE, etc. |
| AWS | CloudWatch Application Signals, or custom metrics + composite alarms |
| Azure | Azure Monitor SLO (preview), Application Insights availability |
| Datadog | SLO widgets with burn rate alerts |
| Grafana | SLO plugin (Grafana Cloud) or manual dashboards |
Step 5 — Alert on Burn Rate, Not Thresholds
Traditional threshold alerts ("error rate > 1 %") are noisy. Instead, alert on burn rate — how fast you're consuming your error budget.
Multi-window, multi-burn-rate alerts
This is the approach from the Google SRE Workbook:
| Window | Burn rate | Budget consumed | Severity |
|---|---|---|---|
| 1 hour | 14.4× | 2 % | Page (wake someone up) |
| 6 hours | 6× | 5 % | Page |
| 3 days | 1× | 10 % | Ticket |
# Prometheus alert example — fast burn
- alert: HighErrorBudgetBurn
expr: |
(
sli:error_ratio:rate1h > (14.4 * 0.005)
and
sli:error_ratio:rate5m > (14.4 * 0.005)
)
labels:
severity: page
annotations:
summary: "Burning error budget 14.4x faster than allowed"
The short confirmation window (5 min) prevents alerting on single spikes.
Step 6 — From SLO to SLA
An SLA is a business contract. Rules:
- SLA target < SLO target. Always have a buffer. If your SLO is 99.9 %, your SLA should be 99.5 % or lower.
- Define measurement precisely. Which endpoint, which HTTP methods, measured where, excluding scheduled maintenance.
- Define consequences. Service credits (e.g., 10 % credit per 0.1 % below SLA), extended subscriptions, or termination rights.
- Exclude force majeure and planned maintenance windows (with advance notice requirements).
Example SLA clause
"Provider guarantees 99.5 % monthly availability for the Production API, measured as the ratio of successful API responses (HTTP 2xx/3xx) to total API requests, excluding scheduled maintenance windows announced 72 hours in advance. For each 0.1 % below the guaranteed level, Customer receives a 10 % service credit on the affected month's invoice, up to a maximum of 30 %."
Step 7 — Review & Iterate
SLOs are living documents. Review quarterly:
- Are we meeting our SLOs? If always at 100 %, the SLO is too loose — tighten it or ship faster.
- Are we burning budget too fast? Identify top error contributors and fix them.
- Do our SLIs still reflect user experience? User complaints that don't show in SLIs mean you're measuring the wrong thing.
- Has the business context changed? New compliance requirements, new customer tiers, new dependencies.
Common Anti-Patterns
| Anti-pattern | Why it's bad | Fix |
|---|---|---|
| SLO = SLA | No safety margin | SLO should be stricter than SLA |
| Too many SLIs | Alert fatigue, unclear priorities | 2–4 per service max |
| Average latency SLI | Hides tail latency problems | Use percentiles (p99) |
| 100 % target | Infinite cost, zero velocity | Accept a realistic error budget |
| No error budget policy | SLOs become vanity metrics | Write and enforce a policy |
| Measuring server-side only | Misses network / CDN / client issues | Measure at the edge |
Resources
- Google SRE Book — Service Level Objectives
- Google SRE Workbook — Implementing SLOs
- OpenSLO specification — Vendor-neutral SLO-as-code format
- Sloth — Generate Prometheus SLO rules from a simple YAML spec
- Art of SLOs (Google course) — Free Coursera course
:::