tadata
Back to home

Incident Management: Building a Culture of Reliability

#sre#devops#incident-management#reliability

Incident management is not about tools -- it is about people, process, and learning. Organizations that handle incidents well do not have fewer failures; they recover faster and learn more from each one.

Incident Lifecycle

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  Detect  │───>│  Triage  │───>│ Mitigate │───>│ Resolve  │───>│  Learn   │
└──────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘
     │               │               │               │               │
  Alerting       Severity        War room       Root cause       Postmortem
  Anomaly        Assign IC      Communicate     Deploy fix       Action items
  User report    Notify         Rollback        Verify           Share widely
                 stakeholders   Feature flag    Close            Update runbooks

Key roles during an incident:

  • Incident Commander (IC): Owns coordination, decisions, communication cadence
  • Technical Lead: Drives investigation and mitigation
  • Communications Lead: Updates stakeholders, status page, customers

Severity Level Matrix

SeverityImpactExamplesResponse TimeUpdate CadenceWho is Paged
SEV1 / P1Total service outage, data loss riskProduction down, payment processing failed< 5 minEvery 15 minOn-call + Eng Manager + VP
SEV2 / P2Major feature degraded, significant user impactSearch broken, latency > 10x normal< 15 minEvery 30 minOn-call + Eng Manager
SEV3 / P3Minor feature degraded, workaround existsDashboard slow, non-critical integration down< 1 hourEvery 2 hoursOn-call team
SEV4 / P4Cosmetic issue, no user impactLogging noise, minor UI glitchNext business dayDailyTeam backlog
SEV5 / P5Informational, potential riskCapacity trending, certificate expiring soonPlanned sprintWeeklyTeam backlog

Metrics Framework

MetricDefinitionTarget (Mature Org)How to Measure
MTTD (Mean Time to Detect)Time from failure start to first alert< 5 minAlert timestamp - failure start
MTTA (Mean Time to Acknowledge)Time from alert to human response< 5 min (P1)Ack timestamp - alert timestamp
MTTM (Mean Time to Mitigate)Time from detection to user impact reduced< 30 min (P1)Mitigation timestamp - detection
MTTR (Mean Time to Resolve)Time from detection to full resolution< 4 hours (P1)Resolution timestamp - detection
MTBF (Mean Time Between Failures)Time between incidents for a serviceTrending upIncident frequency analysis
Postmortem completion rate% of P1/P2 incidents with postmortem100%Postmortem count / incident count

Postmortem Template Structure

Postmortem: [Incident Title]
├── Metadata
│   ├── Date, Duration, Severity
│   ├── Incident Commander
│   └── Authors / Reviewers
├── Summary (3-5 sentences)
├── Impact
│   ├── Users affected (count, %)
│   ├── Revenue impact
│   └── SLO budget consumed
├── Timeline (timestamped)
│   ├── Detection
│   ├── Key decisions
│   ├── Mitigation steps
│   └── Resolution
├── Root Cause Analysis
│   ├── Contributing factors
│   └── 5 Whys or Fishbone diagram
├── What Went Well
├── What Went Poorly
├── Action Items
│   ├── [P1] Immediate fixes
│   ├── [P2] Systemic improvements
│   └── [P3] Long-term hardening
└── Lessons Learned

Critical rule: Postmortems must be blameless. Focus on systems and processes, never individuals.

Tool Comparison

CapabilityPagerDutyOpsGenie (Atlassian)incident.ioRootlyFireHydrant
On-call schedulingExcellentExcellentVia integrationVia integrationGood
Escalation policiesExcellentExcellentGoodGoodGood
Slack-native workflowPluginPluginNativeNativeNative
Status pageAdd-onStatuspageBuilt-inBuilt-inBuilt-in
Postmortem workflowBasicBasicExcellentExcellentExcellent
Automation/runbooksEvent OrchestrationGoodWorkflowsWorkflowsRunbooks
PricingPer-user, tieredPer-userPer-userPer-userPer-user
Best forEnterprise, mature orgsAtlassian shopsSlack-first teamsSlack-first teamsProcess-heavy orgs

Building a Healthy Incident Culture

On-call should not be painful. If on-call is dreaded, it signals systemic problems: too many alerts, poor runbooks, or insufficient automation. Track on-call burden (pages per shift, off-hours pages) and set improvement targets.

Practice incidents. Run regular game days and chaos engineering exercises. Teams that practice incident response perform dramatically better during real incidents.

Share widely. Publish postmortems organization-wide. The learning value of an incident is proportional to how many people read the postmortem.

Resources

:::