Disaster Recovery Architecture: Tiers, Trade-offs & Testing
#architecture#disaster-recovery#reliability#cloud#business-continuity
Disaster recovery (DR) is not a feature you add later. It is an architectural decision that shapes infrastructure cost, operational complexity, and business risk tolerance from day one. The gap between "we have backups" and "we can recover in minutes" is measured in budget and discipline.
DR Tier Comparison
| Tier | Strategy | RPO | RTO | Relative Cost | Data Loss Risk |
|---|---|---|---|---|---|
| Tier 0 | No DR plan | Undefined | Days-Never | 1x (baseline) | Total loss possible |
| Tier 1 | Backup & Restore | Hours-Days | Hours-Days | 1.1x | Last backup window |
| Tier 2 | Pilot Light | Minutes-Hours | Hours | 1.3x | Minimal |
| Tier 3 | Warm Standby | Minutes | 15-60 min | 1.5-2x | Very low |
| Tier 4 | Hot Standby | Seconds | Minutes | 2-3x | Near zero |
| Tier 5 | Active-Active | Near zero | Near zero | 3-4x | Effectively zero |
RPO = Recovery Point Objective (how much data can you afford to lose) RTO = Recovery Time Objective (how fast must you be back online)
Multi-Region Architecture Diagram
┌──────────────┐
│ Route 53 │
│ (DNS / GTM) │
└──────┬───────┘
│ health-check routing
┌────────────┴────────────┐
│ │
┌────────▼────────┐ ┌────────▼────────┐
│ Region A │ │ Region B │
│ (Primary) │ │ (Secondary) │
│ │ │ │
│ ┌───────────┐ │ │ ┌───────────┐ │
│ │ ALB │ │ │ │ ALB │ │
│ └─────┬─────┘ │ │ └─────┬─────┘ │
│ │ │ │ │ │
│ ┌─────▼─────┐ │ │ ┌─────▼─────┐ │
│ │ App Tier │ │ │ │ App Tier │ │
│ │ (EKS/ECS) │ │ │ │ (EKS/ECS) │ │
│ └─────┬─────┘ │ │ └─────┬─────┘ │
│ │ │ │ │ │
│ ┌─────▼─────┐ │ │ ┌─────▼─────┐ │
│ │ RDS │──┼──────┼─►│ RDS │ │
│ │ Primary │ │ async │ │ Read │ │
│ │ │ │ repl │ │ Replica │ │
│ └───────────┘ │ │ └───────────┘ │
│ │ │ │
│ ┌───────────┐ │ │ ┌───────────┐ │
│ │ S3 Bucket │──┼──────┼─►│ S3 Bucket │ │
│ │ (primary) │ │ CRR │ │ (replica) │ │
│ └───────────┘ │ │ └───────────┘ │
└─────────────────┘ └─────────────────┘
DR Testing Checklist
| Test Type | Frequency | Scope | Disruption | Confidence |
|---|---|---|---|---|
| Tabletop exercise | Quarterly | Walk through runbooks | None | Low-Medium |
| Component failover | Monthly | Single service/DB failover | Minimal | Medium |
| Regional failover | Semi-annual | Full region switchover | Moderate | High |
| Chaos engineering | Continuous | Random failure injection | Varies | High |
| Full DR drill | Annual | Simulate complete outage | Significant | Very High |
| Backup restore test | Monthly | Restore from backup to verify | None | Medium |
Cost vs Recovery Matrix
Cost ▲
4x │ ● Active-Active
│
3x │ ● Hot Standby
│
2x │ ● Warm Standby
│
1.5x │ ● Pilot Light
│
1x │ ● Backup/Restore
│
└────────────────────────────────────────────► Recovery Speed
Days Hours 30min Minutes Seconds
Architecture Decisions by Business Criticality
| System Type | Recommended Tier | Justification |
|---|---|---|
| Internal tools | Tier 1-2 | Downtime tolerable, cost-sensitive |
| B2B SaaS (standard) | Tier 3 | SLA typically 99.9%, hourly RPO acceptable |
| B2B SaaS (enterprise) | Tier 4 | SLA 99.95%+, minutes RPO |
| E-commerce | Tier 4 | Revenue loss per minute is quantifiable |
| Financial services | Tier 5 | Regulatory requirements, zero data loss |
| Healthcare (critical) | Tier 5 | Patient safety, compliance mandates |
Key Insight
The most common DR failure is not technical -- it is that the plan was never tested. A DR plan that has not been exercised in the last 6 months is a hypothesis, not a plan.