Chaos Engineering

Principles, Experiments & GameDay Practices

Resilience Patterns | Technical Operations Excellence

5
Maturity Levels
4
Experiment Phases
2011
Chaos Monkey Born
0
Incidents During

Chaos Engineering Principles

PrincipleDescription
HypothesisDefine steady state & expected behavior
Vary Real-WorldSimulate production conditions
Run in ProdStaging doesn't catch all issues
AutomateContinuous experimentation
Minimize BlastStart small, abort on harm

Experiment Design

PhaseActions
1. HypothesisDefine steady state metrics
2. DesignChoose failure injection type
3. ExecuteRun with monitoring active
4. AnalyzeCompare results to hypothesis

Maturity Model

LevelCapability
1. Ad-hocManual, sporadic testing
2. BasicSimple failure injection
3. RepeatableDocumented experiments
4. AutomatedCI/CD integrated chaos
5. OptimizedContinuous chaos in prod

10 Core Experiments

ExperimentTests
Instance KillAuto-recovery, failover
Zone FailureMulti-AZ resilience
Network LatencyTimeout handling
Packet LossRetry logic
Dependency DownCircuit breakers

Also: CPU stress, memory pressure, disk fill, DNS failure, clock skew

Safety Requirements

Abort Conditions

Define clear stop criteria before starting

Blast Radius

Limit scope; start with 1% of traffic

Rollback Plan

Instant recovery must be ready

GameDay Format

TimeActivity
0:00Brief team, review hypothesis
0:15Start observability baseline
0:30Inject failure
1:00Observe, document behaviors
1:30Stop injection, verify recovery
2:00Debrief, document findings

GameDay Roles

RoleResponsibility
FacilitatorRun experiment, track time
ObserverMonitor dashboards
ScribeDocument findings
Safety OfficerCall abort if needed

Safety Checklist

  • ☐ Abort conditions defined
  • ☐ Rollback plan documented
  • ☐ Blast radius limited (<10% traffic)
  • ☐ Monitoring dashboards open
  • ☐ Stakeholders notified

Fail Safely

Better to find weaknesses before your customers do.