Observability Mastery

Three Pillars, Observability 2.0 & Modern Instrumentation

Observability | Technical Operations Excellence

3
Classic Pillars
30s
Target MTTD
OTel
Standard
2.0
New Paradigm

Three Classic Pillars

Metrics

Aggregated counts, gauges, histograms. Tools: Prometheus, InfluxDB

Logs

Discrete events, timestamps, stack traces. Tools: Loki, Elasticsearch

Traces

Request flow, spans, latency breakdown. Tools: Tempo, Jaeger

Observability 2.0

Observability is about asking arbitrary questions without shipping new code.

- Charity Majors, Honeycomb

Classic (1.0)Modern (2.0)
Pre-defined metricsArbitrary queries
Three separate pillarsWide structured events
Dashboard-drivenExploration-driven
Known unknownsUnknown unknowns

USE Method

For every resource: Utilization, Saturation, Errors

ResourceUSE
CPU%busyRun queueErrors
Memory%usedSwap I/OOOM
Disk%utilQueueSMART
NetworkBWRetransErrors

Sampling Strategies

TypeWhenTrade-off
HeadAt request startFast, may miss errors
TailAfter completionSmart, more overhead
AdaptiveDynamic rateBest of both

Always sample 100% of errors and slow requests

OpenTelemetry Standard

SignalStatusFeature
TracesStableW3C context
MetricsStableTemporality
LogsStableTrace correlation
ProfilingExpContinuous

traceparent: {version}-{trace-id}-{parent-id}-{flags}

Cardinality & Cost

You can't run observability infra the same size as production.

- Liz Fong-Jones

IssueImpactFix
High cardinalityQuery slownessAggregation
Unbounded labelsOOM, index boomLabel policies

Danger: user_id, request_id in labels

RED Method

For every service: Rate, Errors, Duration

SignalMetricQuestion
Ratereq/secHow busy?
Errorsfail/secBreaking?
DurationlatencyHow slow?

Instrumentation Best Practices

  • Standardize naming: Consistent metric/span names
  • Add context: Include service, env, version labels
  • Correlate signals: Trace IDs across logs/metrics
  • Sample wisely: 100% errors, sample successes

Instrument at code level, not just infrastructure

Debug Unknown Unknowns

Instrument first, decide what to alert later.