Incident Management: Building a Culture of Reliability
Incident management is not about tools -- it is about people, process, and learning. Organizations that handle incidents well do not have fewer failures; they recover faster and learn more from each one.
Incident Lifecycle
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Detect │───>│ Triage │───>│ Mitigate │───>│ Resolve │───>│ Learn │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │ │
Alerting Severity War room Root cause Postmortem
Anomaly Assign IC Communicate Deploy fix Action items
User report Notify Rollback Verify Share widely
stakeholders Feature flag Close Update runbooks
Key roles during an incident:
- Incident Commander (IC): Owns coordination, decisions, communication cadence
- Technical Lead: Drives investigation and mitigation
- Communications Lead: Updates stakeholders, status page, customers
Severity Level Matrix
| Severity | Impact | Examples | Response Time | Update Cadence | Who is Paged |
|---|---|---|---|---|---|
| SEV1 / P1 | Total service outage, data loss risk | Production down, payment processing failed | < 5 min | Every 15 min | On-call + Eng Manager + VP |
| SEV2 / P2 | Major feature degraded, significant user impact | Search broken, latency > 10x normal | < 15 min | Every 30 min | On-call + Eng Manager |
| SEV3 / P3 | Minor feature degraded, workaround exists | Dashboard slow, non-critical integration down | < 1 hour | Every 2 hours | On-call team |
| SEV4 / P4 | Cosmetic issue, no user impact | Logging noise, minor UI glitch | Next business day | Daily | Team backlog |
| SEV5 / P5 | Informational, potential risk | Capacity trending, certificate expiring soon | Planned sprint | Weekly | Team backlog |
Metrics Framework
| Metric | Definition | Target (Mature Org) | How to Measure |
|---|---|---|---|
| MTTD (Mean Time to Detect) | Time from failure start to first alert | < 5 min | Alert timestamp - failure start |
| MTTA (Mean Time to Acknowledge) | Time from alert to human response | < 5 min (P1) | Ack timestamp - alert timestamp |
| MTTM (Mean Time to Mitigate) | Time from detection to user impact reduced | < 30 min (P1) | Mitigation timestamp - detection |
| MTTR (Mean Time to Resolve) | Time from detection to full resolution | < 4 hours (P1) | Resolution timestamp - detection |
| MTBF (Mean Time Between Failures) | Time between incidents for a service | Trending up | Incident frequency analysis |
| Postmortem completion rate | % of P1/P2 incidents with postmortem | 100% | Postmortem count / incident count |
Postmortem Template Structure
Postmortem: [Incident Title]
├── Metadata
│ ├── Date, Duration, Severity
│ ├── Incident Commander
│ └── Authors / Reviewers
├── Summary (3-5 sentences)
├── Impact
│ ├── Users affected (count, %)
│ ├── Revenue impact
│ └── SLO budget consumed
├── Timeline (timestamped)
│ ├── Detection
│ ├── Key decisions
│ ├── Mitigation steps
│ └── Resolution
├── Root Cause Analysis
│ ├── Contributing factors
│ └── 5 Whys or Fishbone diagram
├── What Went Well
├── What Went Poorly
├── Action Items
│ ├── [P1] Immediate fixes
│ ├── [P2] Systemic improvements
│ └── [P3] Long-term hardening
└── Lessons Learned
Critical rule: Postmortems must be blameless. Focus on systems and processes, never individuals.
Tool Comparison
| Capability | PagerDuty | OpsGenie (Atlassian) | incident.io | Rootly | FireHydrant |
|---|---|---|---|---|---|
| On-call scheduling | Excellent | Excellent | Via integration | Via integration | Good |
| Escalation policies | Excellent | Excellent | Good | Good | Good |
| Slack-native workflow | Plugin | Plugin | Native | Native | Native |
| Status page | Add-on | Statuspage | Built-in | Built-in | Built-in |
| Postmortem workflow | Basic | Basic | Excellent | Excellent | Excellent |
| Automation/runbooks | Event Orchestration | Good | Workflows | Workflows | Runbooks |
| Pricing | Per-user, tiered | Per-user | Per-user | Per-user | Per-user |
| Best for | Enterprise, mature orgs | Atlassian shops | Slack-first teams | Slack-first teams | Process-heavy orgs |
Building a Healthy Incident Culture
On-call should not be painful. If on-call is dreaded, it signals systemic problems: too many alerts, poor runbooks, or insufficient automation. Track on-call burden (pages per shift, off-hours pages) and set improvement targets.
Practice incidents. Run regular game days and chaos engineering exercises. Teams that practice incident response perform dramatically better during real incidents.
Share widely. Publish postmortems organization-wide. The learning value of an incident is proportional to how many people read the postmortem.
Resources
- Google SRE Book - Managing Incidents
- Jeli.io - Howie Guide to Post-Incident Learning
- PagerDuty Incident Response Guide
- Learning from Incidents in Software
:::