Principles, Experiments & GameDay Practices
Resilience Patterns | Technical Operations Excellence
| Principle | Description |
|---|---|
| Hypothesis | Define steady state & expected behavior |
| Vary Real-World | Simulate production conditions |
| Run in Prod | Staging doesn't catch all issues |
| Automate | Continuous experimentation |
| Minimize Blast | Start small, abort on harm |
| Phase | Actions |
|---|---|
| 1. Hypothesis | Define steady state metrics |
| 2. Design | Choose failure injection type |
| 3. Execute | Run with monitoring active |
| 4. Analyze | Compare results to hypothesis |
| Level | Capability |
|---|---|
| 1. Ad-hoc | Manual, sporadic testing |
| 2. Basic | Simple failure injection |
| 3. Repeatable | Documented experiments |
| 4. Automated | CI/CD integrated chaos |
| 5. Optimized | Continuous chaos in prod |
| Experiment | Tests |
|---|---|
| Instance Kill | Auto-recovery, failover |
| Zone Failure | Multi-AZ resilience |
| Network Latency | Timeout handling |
| Packet Loss | Retry logic |
| Dependency Down | Circuit breakers |
Also: CPU stress, memory pressure, disk fill, DNS failure, clock skew
Define clear stop criteria before starting
Limit scope; start with 1% of traffic
Instant recovery must be ready
| Time | Activity |
|---|---|
| 0:00 | Brief team, review hypothesis |
| 0:15 | Start observability baseline |
| 0:30 | Inject failure |
| 1:00 | Observe, document behaviors |
| 1:30 | Stop injection, verify recovery |
| 2:00 | Debrief, document findings |
| Role | Responsibility |
|---|---|
| Facilitator | Run experiment, track time |
| Observer | Monitor dashboards |
| Scribe | Document findings |
| Safety Officer | Call abort if needed |
Fail Safely
Better to find weaknesses before your customers do.