tadata
Back to home

SLAs, SLOs & SLIs: Building Reliability Targets That Actually Work

#sre#devops#observability#cloud#monitoring#reliability

Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) form the backbone of modern reliability engineering. Getting them right is the difference between an SRE team drowning in alerts and one that ships confidently.

At a Glance

ConceptWhat it isWho owns itExample
SLIA measured metric of service behaviorEngineering99.2 % of requests < 300 ms
SLOAn internal reliability target on an SLIEngineering + Product99.5 % success rate over 30 days
SLAA contractual commitment with consequencesBusiness / Legal99.9 % uptime or credits issued
Error budgetAllowed unreliability = 1 − SLOEngineering + Product0.5 % failure budget per window

Why This Matters

Without explicit reliability targets, teams default to "make everything 100 % available" — which is both impossible and incredibly expensive. SLOs give you a principled way to balance reliability with velocity: if you have error budget left, ship faster; if you've burned it, slow down and stabilize.

"Hope is not a strategy." — Google SRE Book

Step 1 — Choose the Right SLIs

SLIs must reflect what users actually experience. Common categories:

Availability

The proportion of valid requests that succeed.

SLI = (successful requests) / (total valid requests)

"Successful" typically means HTTP 2xx/3xx responses, excluding 4xx (client errors). Be explicit about what counts.

Latency

The proportion of requests faster than a threshold.

SLI = (requests < 300ms) / (total requests)

Use percentile-based thresholds (p50, p95, p99) rather than averages. A mean latency of 200 ms can hide a p99 of 5 seconds.

Correctness / Quality

The proportion of responses that return the right data.

SLI = (correct responses) / (total responses)

Relevant for data pipelines, ML inference, search results, and financial systems.

Freshness

The proportion of data that is updated within an acceptable delay.

SLI = (records updated within threshold) / (total records)

Critical for dashboards, caches, and event-driven systems.

Throughput

The proportion of time the system processes above a minimum rate.

SLI = (minutes above X req/s) / (total minutes)

Useful for batch processing, streaming pipelines, and ETL jobs.

Tips for choosing SLIs

  • Measure at the edge, not internally. A healthy server behind a broken load balancer still fails users.
  • Fewer is better. 2–4 SLIs per service. If you can't explain them in one sentence each, simplify.
  • Use ratios. "Good events / total events" is the universal SLI format. It normalizes across traffic volume.

Step 2 — Set SLO Targets

An SLO is a target value for an SLI over a time window.

SLO: 99.5 % of requests succeed over a 30-day rolling window

Choosing the right number

TargetDowntime / 30 daysError budgetTypical use case
99 %~7.2 hours1 %Internal tools, dev environments
99.5 %~3.6 hours0.5 %B2B SaaS, non-critical APIs
99.9 %~43 minutes0.1 %Core product APIs, checkout flows
99.95 %~22 minutes0.05 %Payment processing, auth services
99.99 %~4.3 minutes0.01 %Infrastructure (DNS, load balancers)

Rules of thumb

  1. Start lower than you think. It's easy to tighten an SLO, painful to loosen one — especially if it's in an SLA.
  2. Match dependencies. Your SLO can't exceed the SLO of your least reliable dependency.
  3. Separate by user journey. Login might need 99.9 %; a reporting dashboard might only need 99 %.
  4. Use rolling windows (e.g., 30 days) rather than calendar months. Calendar windows create end-of-month panic and reset incentives.

Step 3 — Compute Error Budgets

Error budget = 1 − SLO target

For a 99.5 % SLO over 30 days:

  • Budget = 0.5 % of total requests
  • If you serve 10 M requests/month → 50,000 failures allowed

Error budget policies

Define in advance what happens when budget is consumed:

Budget remainingAction
> 50 %Ship freely, run experiments
25–50 %Proceed with caution, review risky deploys
5–25 %Feature freeze for the service, focus on reliability
0 %All hands on reliability, rollback recent changes

Write this down as a team agreement, signed by engineering and product leadership. Without a policy, error budgets are just dashboards nobody acts on.

Step 4 — Instrument & Measure

OpenTelemetry (recommended)

Use OTel to emit metrics at the application level:

from opentelemetry import metrics

meter = metrics.get_meter("my-service")
request_counter = meter.create_counter(
    "http.server.request.count",
    description="Total HTTP requests",
)
error_counter = meter.create_counter(
    "http.server.error.count",
    description="Failed HTTP requests",
)

# In your request handler:
request_counter.add(1, {"method": "GET", "route": "/api/orders"})
if response.status >= 500:
    error_counter.add(1, {"method": "GET", "route": "/api/orders"})

Prometheus

If you're already on Prometheus, use histograms for latency SLIs:

# prometheus recording rules
groups:
  - name: slo_rules
    rules:
      - record: sli:availability:ratio_30d
        expr: |
          1 - (
            sum(rate(http_requests_total{status=~"5.."}[30d]))
            /
            sum(rate(http_requests_total[30d]))
          )
      - record: sli:latency:ratio_30d
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[30d]))
          /
          sum(rate(http_request_duration_seconds_count[30d]))

Cloud-native options

ProviderSLO tooling
GCPService Monitoring — native SLO definition on Cloud Run, GKE, etc.
AWSCloudWatch Application Signals, or custom metrics + composite alarms
AzureAzure Monitor SLO (preview), Application Insights availability
DatadogSLO widgets with burn rate alerts
GrafanaSLO plugin (Grafana Cloud) or manual dashboards

Step 5 — Alert on Burn Rate, Not Thresholds

Traditional threshold alerts ("error rate > 1 %") are noisy. Instead, alert on burn rate — how fast you're consuming your error budget.

Multi-window, multi-burn-rate alerts

This is the approach from the Google SRE Workbook:

WindowBurn rateBudget consumedSeverity
1 hour14.4×2 %Page (wake someone up)
6 hours5 %Page
3 days10 %Ticket
# Prometheus alert example — fast burn
- alert: HighErrorBudgetBurn
  expr: |
    (
      sli:error_ratio:rate1h > (14.4 * 0.005)
      and
      sli:error_ratio:rate5m > (14.4 * 0.005)
    )
  labels:
    severity: page
  annotations:
    summary: "Burning error budget 14.4x faster than allowed"

The short confirmation window (5 min) prevents alerting on single spikes.

Step 6 — From SLO to SLA

An SLA is a business contract. Rules:

  1. SLA target < SLO target. Always have a buffer. If your SLO is 99.9 %, your SLA should be 99.5 % or lower.
  2. Define measurement precisely. Which endpoint, which HTTP methods, measured where, excluding scheduled maintenance.
  3. Define consequences. Service credits (e.g., 10 % credit per 0.1 % below SLA), extended subscriptions, or termination rights.
  4. Exclude force majeure and planned maintenance windows (with advance notice requirements).

Example SLA clause

"Provider guarantees 99.5 % monthly availability for the Production API, measured as the ratio of successful API responses (HTTP 2xx/3xx) to total API requests, excluding scheduled maintenance windows announced 72 hours in advance. For each 0.1 % below the guaranteed level, Customer receives a 10 % service credit on the affected month's invoice, up to a maximum of 30 %."

Step 7 — Review & Iterate

SLOs are living documents. Review quarterly:

  • Are we meeting our SLOs? If always at 100 %, the SLO is too loose — tighten it or ship faster.
  • Are we burning budget too fast? Identify top error contributors and fix them.
  • Do our SLIs still reflect user experience? User complaints that don't show in SLIs mean you're measuring the wrong thing.
  • Has the business context changed? New compliance requirements, new customer tiers, new dependencies.

Common Anti-Patterns

Anti-patternWhy it's badFix
SLO = SLANo safety marginSLO should be stricter than SLA
Too many SLIsAlert fatigue, unclear priorities2–4 per service max
Average latency SLIHides tail latency problemsUse percentiles (p99)
100 % targetInfinite cost, zero velocityAccept a realistic error budget
No error budget policySLOs become vanity metricsWrite and enforce a policy
Measuring server-side onlyMisses network / CDN / client issuesMeasure at the edge

Resources

:::