tadata
Back to home

Disaster Recovery Architecture: Tiers, Trade-offs & Testing

#architecture#disaster-recovery#reliability#cloud#business-continuity

Disaster recovery (DR) is not a feature you add later. It is an architectural decision that shapes infrastructure cost, operational complexity, and business risk tolerance from day one. The gap between "we have backups" and "we can recover in minutes" is measured in budget and discipline.

DR Tier Comparison

TierStrategyRPORTORelative CostData Loss Risk
Tier 0No DR planUndefinedDays-Never1x (baseline)Total loss possible
Tier 1Backup & RestoreHours-DaysHours-Days1.1xLast backup window
Tier 2Pilot LightMinutes-HoursHours1.3xMinimal
Tier 3Warm StandbyMinutes15-60 min1.5-2xVery low
Tier 4Hot StandbySecondsMinutes2-3xNear zero
Tier 5Active-ActiveNear zeroNear zero3-4xEffectively zero

RPO = Recovery Point Objective (how much data can you afford to lose) RTO = Recovery Time Objective (how fast must you be back online)

Multi-Region Architecture Diagram

                    ┌──────────────┐
                    │   Route 53   │
                    │  (DNS / GTM) │
                    └──────┬───────┘
                           │ health-check routing
              ┌────────────┴────────────┐
              │                         │
     ┌────────▼────────┐      ┌────────▼────────┐
     │  Region A       │      │  Region B       │
     │  (Primary)      │      │  (Secondary)    │
     │                 │      │                 │
     │  ┌───────────┐  │      │  ┌───────────┐  │
     │  │    ALB    │  │      │  │    ALB    │  │
     │  └─────┬─────┘  │      │  └─────┬─────┘  │
     │        │        │      │        │        │
     │  ┌─────▼─────┐  │      │  ┌─────▼─────┐  │
     │  │  App Tier  │  │      │  │  App Tier  │  │
     │  │  (EKS/ECS) │  │      │  │  (EKS/ECS) │  │
     │  └─────┬─────┘  │      │  └─────┬─────┘  │
     │        │        │      │        │        │
     │  ┌─────▼─────┐  │      │  ┌─────▼─────┐  │
     │  │  RDS      │──┼──────┼─►│  RDS      │  │
     │  │  Primary  │  │ async │  │  Read     │  │
     │  │           │  │ repl  │  │  Replica  │  │
     │  └───────────┘  │      │  └───────────┘  │
     │                 │      │                 │
     │  ┌───────────┐  │      │  ┌───────────┐  │
     │  │ S3 Bucket │──┼──────┼─►│ S3 Bucket │  │
     │  │ (primary) │  │ CRR  │  │ (replica) │  │
     │  └───────────┘  │      │  └───────────┘  │
     └─────────────────┘      └─────────────────┘

DR Testing Checklist

Test TypeFrequencyScopeDisruptionConfidence
Tabletop exerciseQuarterlyWalk through runbooksNoneLow-Medium
Component failoverMonthlySingle service/DB failoverMinimalMedium
Regional failoverSemi-annualFull region switchoverModerateHigh
Chaos engineeringContinuousRandom failure injectionVariesHigh
Full DR drillAnnualSimulate complete outageSignificantVery High
Backup restore testMonthlyRestore from backup to verifyNoneMedium

Cost vs Recovery Matrix

Cost ▲
  4x │                                    ● Active-Active
     │
  3x │                          ● Hot Standby
     │
  2x │                ● Warm Standby
     │
1.5x │        ● Pilot Light
     │
  1x │ ● Backup/Restore
     │
     └────────────────────────────────────────────► Recovery Speed
       Days      Hours      30min    Minutes   Seconds

Architecture Decisions by Business Criticality

System TypeRecommended TierJustification
Internal toolsTier 1-2Downtime tolerable, cost-sensitive
B2B SaaS (standard)Tier 3SLA typically 99.9%, hourly RPO acceptable
B2B SaaS (enterprise)Tier 4SLA 99.95%+, minutes RPO
E-commerceTier 4Revenue loss per minute is quantifiable
Financial servicesTier 5Regulatory requirements, zero data loss
Healthcare (critical)Tier 5Patient safety, compliance mandates

Key Insight

The most common DR failure is not technical -- it is that the plan was never tested. A DR plan that has not been exercised in the last 6 months is a hypothesis, not a plan.

Resources