Domain 11: Chaos Engineering

Game Days, Blast Radius Control, Failure Injection

SRE Bot | Resilience | Max 30 Points

0-6
Ad-hoc
7-12
Foundational
13-18
Standardized
19-24
Advanced
25-30
Optimized

Scoring Criteria by Level

LevelCriteria
1No chaos practice; only learn from real outages
2Occasional game days; manual failure injection
3Regular chaos experiments; blast radius controlled
4Continuous chaos in staging; production game days
5Chaos in production daily; antifragile systems

Assessment Questions

#QuestionMax
1How often do you run chaos experiments?6
2How do you control blast radius?6
3Do you run game days?6
4How do you apply learnings from chaos?6
5What chaos tooling do you use?6

Focus Areas

  • Experiments: Hypothesis-driven failure injection
  • Blast Radius: Start small, expand gradually
  • Game Days: Scheduled team resilience exercises
  • Tooling: Chaos Monkey, Gremlin, Litmus

Anti-Patterns (Red Flags)

  • Chaos without hypothesis
  • No blast radius controls
  • Chaos findings ignored
  • Only chaos in staging
  • Chaos as one-time event

Evidence Checklist

  • Chaos experiment runbooks exist
  • Game day schedule published
  • Blast radius controls documented
  • Chaos findings tracked and fixed
  • Production chaos (with controls)

Related Domains

DomainRelationship
ReliabilityValidate patterns via chaos
DRTest DR via chaos experiments
IncidentsBuild muscle memory for response

Break Things on Purpose

Find failures before they find you.