Alert Tuning Playbook

Reducing Noise, Improving Signal

Observability | Technical Operations Excellence

<2
Pages per Shift
<5%
False Positive Rate
80%
Actionable Target
30s
MTTD Goal

Alert Quality Framework

QualityCriteria
ActionableRequires immediate human action
Symptom-basedAlerts on user impact, not causes
TimelyDetects issues within SLO window
PrioritizedClear severity levels
DocumentedLinked to runbooks

Symptom vs Cause Alerts

TypeExamplePage?
SymptomError rate >1%Yes
SymptomLatency p99 >500msYes
CauseCPU >80%Notify only
CauseDisk >90%Ticket

Page on symptoms; ticket causes for investigation

Burn Rate Alerting

WindowBurn RateAction
1 hour14.4xPage immediately
6 hours6xPage
24 hours3xTicket
72 hours1xReview weekly

Burn rate = (1 - SLI) / (1 - SLO target)

Alert Fatigue Indicators

  • >2 pages per on-call shift
  • >5% false positive rate
  • Same alert firing repeatedly
  • Engineers ignoring alerts
  • No runbook links

Noise Reduction Strategies

Aggregate Related

Group alerts by service or component

Adjust Thresholds

Based on historical data and SLOs

Add Hysteresis

Require sustained violations to fire

Severity Levels

SeverityResponseExample
P1Page, escalateService down
P2Page, working hoursDegraded service
P3Ticket, next dayNon-critical issue
P4Ticket, backlogImprovement

Alert Review Cadence

ActivityFrequency
Alert reviewWeekly
Threshold tuningMonthly
Alert inventoryQuarterly
Delete unusedQuarterly

Golden Rule

Every alert should either require immediate action or be deleted.

Signal Over Noise

The best alert is one that never fires unnecessarily.