Observability Mastery | Bot Army SRE

3
Classic Pillars

30s

Target MTTD

OTel

Standard

2.0

New Paradigm

Aggregated counts, gauges, histograms. Tools: Prometheus, InfluxDB

Discrete events, timestamps, stack traces. Tools: Loki, Elasticsearch

Request flow, spans, latency breakdown. Tools: Tempo, Jaeger

Observability is about asking arbitrary questions without shipping new code.

- Charity Majors, Honeycomb

For every resource: Utilization, Saturation, Errors

Resource	U	S	E
CPU	%busy	Run queue	Errors
Memory	%used	Swap I/O	OOM
Disk	%util	Queue	SMART
Network	BW	Retrans	Errors

Always sample 100% of errors and slow requests

traceparent: {version}-{trace-id}-{parent-id}-{flags}

You can't run observability infra the same size as production.

- Liz Fong-Jones

Issue	Impact	Fix
High cardinality	Query slowness	Aggregation
Unbounded labels	OOM, index boom	Label policies

Danger: user_id, request_id in labels

For every service: Rate, Errors, Duration

Instrument at code level, not just infrastructure

Debug Unknown Unknowns

Instrument first, decide what to alert later.