Multi-Window Alerting

Burn Rates, SLO-Based Alerts & Alert Attributes

Observability Deep Dive | Technical Operations Excellence

14.4x
Critical Burn Rate
4
Alert Attributes
<2
Pages/Week Target
<5%
False Positives

Multi-Window Burn Rate Alerts

TypeBudgetWindowBurn
Page (Critical)2% / 1h1h + 5m14.4x
Page (High)5% / 6h6h + 30m6x
Ticket10% / 3d72h + 6h1x

Dual windows prevent alert flapping while catching fast burns

Four Alert Attributes

AttributeDefinitionGoal
Precision% genuine alertsMinimize FPs
Recall% incidents caughtCatch all issues
DetectionTime to notifyAlert quickly
ResetTime to resolveAuto-clear

Burn Rate Formula

Burn Rate = (1 - SLO) / Time Window
14.4x = consume 30-day budget in ~2 days

Burn RateBudget Exhaustion
>5%/dayImmediate incident
2-5%/dayInvestigation needed
<2%/dayNormal ops

Alert Categories

CategoryResponseWhen
PageImmediateUser impact
NotifyHoursDegradation
TicketNext daySlow drift
LogReviewInformational

Symptom vs Cause Hierarchy

User-Facing Symptoms

Error rate, latency, availability - PAGE these

System Symptoms

Queue depth, connection pool - NOTIFY these

Underlying Causes

CPU, memory, disk - TICKET or LOG these

Healthy On-Call Metrics

MetricTarget
Pages per week<2
False positive rate<5%
Off-hours pages<1
Actionable %>95%

Noise Reduction Techniques

  • Aggregation: Group related alerts
  • Suppression: Mute during maintenance
  • Deduplication: Same incident once
  • Auto-escalation: After X minutes
  • Alert correlation: Link to root cause

Golden Rules

  • Actionable: Can I do something?
  • Urgent: Does it need attention now?
  • Real: Is this actually happening?
  • Human judgment: Does this need a person?

Alert on Symptoms

Page for user pain, ticket for slow burns.