Runbook Quick Reference

Templates, Decision Trees, and MTTR Targets

Incident Management | Technical Operations Excellence

10
Runbook Templates
<5min
Triage Target
<1hr
MTTR Target
80%
Runbook Coverage

10 Essential Runbook Types

#RunbookMTTR Target
1Service Restart5 min
2Deployment Rollback10 min
3Database Failover15 min
4Cache Clear5 min
5Traffic Shift10 min
6Scale Out5 min
7Certificate Rotation15 min
8DNS Update10 min
9Feature Flag Toggle2 min
10Emergency Access5 min

Runbook Structure

SectionContent
OverviewWhat this runbook addresses
SymptomsHow to recognize the issue
PrerequisitesRequired access & tools
StepsNumbered procedure
VerificationHow to confirm success
RollbackIf things go wrong
EscalationWho to contact next

Decision Tree: High Latency

  • Check: Is it a single service or all?
    • Single → Check that service's resources
    • All → Check shared dependencies (DB, cache)
  • Check: Recent deployment?
    • Yes → Consider rollback
    • No → Check traffic levels
  • Check: Resource exhaustion?
    • Yes → Scale or restart
    • No → Check network, dependencies

Verification Checklist

CheckHow
Service healthyHealth endpoint returns 200
Metrics normalGrafana dashboards green
Errors stoppedError rate below threshold
Latency normalp99 within SLO
Logs cleanNo error spikes in logs

Decision Tree: Errors Spike

  • Check: Error type?
    • 5xx → Server-side issue
    • 4xx → Client or config issue
  • Check: Pattern?
    • Sudden spike → Deployment or config
    • Gradual → Resource exhaustion
  • Check: Scope?
    • One endpoint → Check that handler
    • All endpoints → Check infrastructure

Runbook Quality Criteria

Testable

Can be verified in staging/DR drills

Automatable

Steps are scriptable for future automation

Measurable

Includes timing targets and success criteria

Quick Commands

ActionExample
Pod restartkubectl rollout restart
Rollbackkubectl rollout undo
Scalekubectl scale --replicas

Document to Automate

Today's runbook is tomorrow's automation.