Infrastructure Reliability

Kubernetes, Databases, TSDB & Observability Backends

Infrastructure | Technical Operations Excellence

3
K8s Probes
N+2
Redundancy
IaC
GitOps Pattern
mTLS
Service Mesh

Kubernetes Reliability

ComponentPurposeKey Config
HPAScale pods outCPU/memory targets
VPARight-size podsupdateMode: Off
PDBProtect availabilityminAvailable: 2

HPA + VPA conflict on same metrics - use VPA in recommend-only mode

Kubernetes Probes

ProbePurposeWhen
StartupContainer startedFirst (slow apps)
LivenessContainer runningCatch deadlocks
ReadinessReady for trafficLoad balancer

Liveness: lightweight checks. Let fatal errors crash, don't restart.

Database Reliability

PatternUse Case
Read replicasScale read traffic
Multi-AZHA failover
ShardingHorizontal scale
Connection poolingLimit connections

Replication ≠ Backup - corrupt data replicates everywhere

Resource Management

ResourceLimit Strategy
CPURequests = P50, Limits = P99
MemoryRequest = Limit (no OOM)
EphemeralLimit to prevent node evict

Profile in production to set accurate requests

Secrets Management

HashiCorp Vault core capabilities:

FeatureBenefit
Dynamic secretsShort-lived, on-demand
Encryption as a serviceTransit secrets engine
Identity-based accessRBAC, namespaces
Audit loggingSIEM integration

Service Mesh

FeatureBenefit
mTLSEncrypted service-to-service
Traffic mgmtCanary, A/B, retries
ObservabilityDistributed tracing
Circuit breakingPrevent cascade failures

Start simple; add mesh when complexity justifies overhead

Observability Backends

SignalOSS StackKey Feature
MetricsPrometheus, MimirPromQL, federation
LogsLoki, OpenSearchLogQL, labels
TracesTempo, JaegerTrace correlation

Grafana unifies all three signals in one UI

Time Series Databases

TSDBBest For
InfluxDBIoT, high cardinality
PrometheusK8s metrics, alerts
kdb+Finance, ultra-low latency
VictoriaMetricsLong-term retention

Cattle, Not Pets

Infrastructure should be reproducible and replaceable.