Chaos Engineering | Bot Army SRE

Maturity Levels

Experiment Phases

2011

Chaos Monkey Born

Incidents During

Principle	Description
Hypothesis	Define steady state & expected behavior
Vary Real-World	Simulate production conditions
Run in Prod	Staging doesn't catch all issues
Automate	Continuous experimentation
Minimize Blast	Start small, abort on harm

Phase	Actions
1. Hypothesis	Define steady state metrics
2. Design	Choose failure injection type
3. Execute	Run with monitoring active
4. Analyze	Compare results to hypothesis

Also: CPU stress, memory pressure, disk fill, DNS failure, clock skew

Define clear stop criteria before starting

Limit scope; start with 1% of traffic

Instant recovery must be ready

Time	Activity
0:00	Brief team, review hypothesis
0:15	Start observability baseline
0:30	Inject failure
1:00	Observe, document behaviors
1:30	Stop injection, verify recovery
2:00	Debrief, document findings

Fail Safely

Better to find weaknesses before your customers do.