Three Pillars, Observability 2.0 & Modern Instrumentation
Observability | Technical Operations Excellence
Aggregated counts, gauges, histograms. Tools: Prometheus, InfluxDB
Discrete events, timestamps, stack traces. Tools: Loki, Elasticsearch
Request flow, spans, latency breakdown. Tools: Tempo, Jaeger
Observability is about asking arbitrary questions without shipping new code.
- Charity Majors, Honeycomb
| Classic (1.0) | Modern (2.0) |
|---|---|
| Pre-defined metrics | Arbitrary queries |
| Three separate pillars | Wide structured events |
| Dashboard-driven | Exploration-driven |
| Known unknowns | Unknown unknowns |
For every resource: Utilization, Saturation, Errors
| Resource | U | S | E |
|---|---|---|---|
| CPU | %busy | Run queue | Errors |
| Memory | %used | Swap I/O | OOM |
| Disk | %util | Queue | SMART |
| Network | BW | Retrans | Errors |
| Type | When | Trade-off |
|---|---|---|
| Head | At request start | Fast, may miss errors |
| Tail | After completion | Smart, more overhead |
| Adaptive | Dynamic rate | Best of both |
Always sample 100% of errors and slow requests
| Signal | Status | Feature |
|---|---|---|
| Traces | Stable | W3C context |
| Metrics | Stable | Temporality |
| Logs | Stable | Trace correlation |
| Profiling | Exp | Continuous |
traceparent: {version}-{trace-id}-{parent-id}-{flags}
You can't run observability infra the same size as production.
- Liz Fong-Jones
| Issue | Impact | Fix |
|---|---|---|
| High cardinality | Query slowness | Aggregation |
| Unbounded labels | OOM, index boom | Label policies |
Danger: user_id, request_id in labels
For every service: Rate, Errors, Duration
| Signal | Metric | Question |
|---|---|---|
| Rate | req/sec | How busy? |
| Errors | fail/sec | Breaking? |
| Duration | latency | How slow? |
Instrument at code level, not just infrastructure
Debug Unknown Unknowns
Instrument first, decide what to alert later.