Multi-Window Alerting | Bot Army SRE

14.4x

Critical Burn Rate

4

Alert Attributes

<2

Pages/Week Target

<5%

False Positives

Multi-Window Burn Rate Alerts

Type	Budget	Window	Burn
Page (Critical)	2% / 1h	1h + 5m	14.4x
Page (High)	5% / 6h	6h + 30m	6x
Ticket	10% / 3d	72h + 6h	1x

Dual windows prevent alert flapping while catching fast burns

Four Alert Attributes

Attribute	Definition	Goal
Precision	% genuine alerts	Minimize FPs
Recall	% incidents caught	Catch all issues
Detection	Time to notify	Alert quickly
Reset	Time to resolve	Auto-clear

Burn Rate Formula

Burn Rate = (1 - SLO) / Time Window
14.4x = consume 30-day budget in ~2 days

Burn Rate	Budget Exhaustion
>5%/day	Immediate incident
2-5%/day	Investigation needed
<2%/day	Normal ops

Alert Categories

Category	Response	When
Page	Immediate	User impact
Notify	Hours	Degradation
Ticket	Next day	Slow drift
Log	Review	Informational

Symptom vs Cause Hierarchy

User-Facing Symptoms

Error rate, latency, availability - PAGE these

System Symptoms

Queue depth, connection pool - NOTIFY these

Underlying Causes

CPU, memory, disk - TICKET or LOG these

Healthy On-Call Metrics

Metric	Target
Pages per week	<2
False positive rate	<5%
Off-hours pages	<1
Actionable %	>95%

Noise Reduction Techniques

Aggregation: Group related alerts
Suppression: Mute during maintenance
Deduplication: Same incident once
Auto-escalation: After X minutes
Alert correlation: Link to root cause

Golden Rules

Actionable: Can I do something?
Urgent: Does it need attention now?
Real: Is this actually happening?
Human judgment: Does this need a person?

Alert on Symptoms

Page for user pain, ticket for slow burns.