Designing for Recovery

Recovery Principles, Breakglass & Emergency Access

Infrastructure Reliability | Technical Operations Excellence

3-2-1
Backup Rule
<15m
Tier-0 RTO
0
Tier-0 RPO
MPA
Multi-Party Auth

Recovery Design Principles

PrincipleApplication
Go fast, guardedSpeed with policy guardrails
Minimize time depsDon't wait for wall-clock
Know intended stateEncode complete configuration
Test restoresUntested backups = no backups

3-2-1 Backup Strategy

3 Copies

Original + 2 backups minimum

2 Media Types

Different storage technologies

1 Offsite

Geographic separation

RTO & RPO Targets

TierSystemsRTORPO
0Critical APIs<15m0
1Core services<4h<1h
2Internal tools<24h<4h
3Dev/test<72h<24h

Recovery Testing

  • Quarterly: Full restore drill for Tier-0
  • Monthly: Point-in-time recovery test
  • Weekly: Backup integrity verification
  • Daily: Automated backup monitoring

Breakglass Procedures

MechanismPurpose
BreakglassOverride normal access controls
MPAMulti-party authorization
Offline credsIndependent of primary systems
Temp accessTime-bounded elevation

Document business justification for all elevated access

Emergency Access Must-Haves

  • Work when systems fail: Independent channel
  • Pre-staged credentials: Not just-in-time during crisis
  • Tested regularly: Part of disaster drills
  • Audit trail: All access logged

Disaster Validation

ExerciseFrequency
TabletopMonthly
Failover drillQuarterly
Full DR testAnnually
Chaos experimentsContinuous

Recovery Checklist

  • Runbooks documented and tested
  • Contact list current
  • Backup restore verified
  • Failover procedure practiced

Plan to Fail

The best recovery is the one you've practiced.