Infrastructure Reliability | Bot Army SRE

3

K8s Probes

N+2

Redundancy

IaC

GitOps Pattern

mTLS

Service Mesh

Kubernetes Reliability

Component	Purpose	Key Config
HPA	Scale pods out	CPU/memory targets
VPA	Right-size pods	updateMode: Off
PDB	Protect availability	minAvailable: 2

HPA + VPA conflict on same metrics - use VPA in recommend-only mode

Kubernetes Probes

Probe	Purpose	When
Startup	Container started	First (slow apps)
Liveness	Container running	Catch deadlocks
Readiness	Ready for traffic	Load balancer

Liveness: lightweight checks. Let fatal errors crash, don't restart.

Database Reliability

Pattern	Use Case
Read replicas	Scale read traffic
Multi-AZ	HA failover
Sharding	Horizontal scale
Connection pooling	Limit connections

Replication ≠ Backup - corrupt data replicates everywhere

Resource Management

Resource	Limit Strategy
CPU	Requests = P50, Limits = P99
Memory	Request = Limit (no OOM)
Ephemeral	Limit to prevent node evict

Profile in production to set accurate requests

Secrets Management

HashiCorp Vault core capabilities:

Feature	Benefit
Dynamic secrets	Short-lived, on-demand
Encryption as a service	Transit secrets engine
Identity-based access	RBAC, namespaces
Audit logging	SIEM integration

Service Mesh

Feature	Benefit
mTLS	Encrypted service-to-service
Traffic mgmt	Canary, A/B, retries
Observability	Distributed tracing
Circuit breaking	Prevent cascade failures

Start simple; add mesh when complexity justifies overhead

Observability Backends

Signal	OSS Stack	Key Feature
Metrics	Prometheus, Mimir	PromQL, federation
Logs	Loki, OpenSearch	LogQL, labels
Traces	Tempo, Jaeger	Trace correlation

Grafana unifies all three signals in one UI

Time Series Databases

TSDB	Best For
InfluxDB	IoT, high cardinality
Prometheus	K8s metrics, alerts
kdb+	Finance, ultra-low latency
VictoriaMetrics	Long-term retention

Cattle, Not Pets

Infrastructure should be reproducible and replaceable.