Reducing Noise, Improving Signal
Observability | Technical Operations Excellence
| Quality | Criteria |
|---|---|
| Actionable | Requires immediate human action |
| Symptom-based | Alerts on user impact, not causes |
| Timely | Detects issues within SLO window |
| Prioritized | Clear severity levels |
| Documented | Linked to runbooks |
| Type | Example | Page? |
|---|---|---|
| Symptom | Error rate >1% | Yes |
| Symptom | Latency p99 >500ms | Yes |
| Cause | CPU >80% | Notify only |
| Cause | Disk >90% | Ticket |
Page on symptoms; ticket causes for investigation
| Window | Burn Rate | Action |
|---|---|---|
| 1 hour | 14.4x | Page immediately |
| 6 hours | 6x | Page |
| 24 hours | 3x | Ticket |
| 72 hours | 1x | Review weekly |
Burn rate = (1 - SLI) / (1 - SLO target)
Group alerts by service or component
Based on historical data and SLOs
Require sustained violations to fire
| Severity | Response | Example |
|---|---|---|
| P1 | Page, escalate | Service down |
| P2 | Page, working hours | Degraded service |
| P3 | Ticket, next day | Non-critical issue |
| P4 | Ticket, backlog | Improvement |
| Activity | Frequency |
|---|---|
| Alert review | Weekly |
| Threshold tuning | Monthly |
| Alert inventory | Quarterly |
| Delete unused | Quarterly |
Every alert should either require immediate action or be deleted.
Signal Over Noise
The best alert is one that never fires unnecessarily.