Industry Leaders | Bot Army SRE

50%

Google Eng Cap

100s

Netflix Deploys/Day

5

HRO Principles

6

AWS Pillars

Google SRE Principles

Principle	Application
50% Rule	Max 50% time on ops/toil
Error Budgets	Balance reliability vs velocity
SLO-based	Objective reliability targets
Blameless	Focus on systems, not people

class SRE implements interface DevOps

Netflix Chaos Engineering

"Avoid failure by failing constantly"

Tool	What It Does
Chaos Monkey	Randomly kills instances
Latency Monkey	Injects network delays
Chaos Gorilla	Simulates AZ failure

2014 AWS outage: 10% of servers affected; Netflix ran uninterrupted

SRE Evolution

Era	Period	Focus
Chaos Years	1990-2005	Cowboy ops
DevOps	2005-2015	Automation
SRE	2014-2018	Reliability
Platform	2018-Now	Developer UX

Tool Evolution

2000s: Nagios, Puppet, Chef
2010s: Docker, K8s, Prometheus, Terraform
2020s: OpenTelemetry, GitOps, AI/ML Ops

Industry Best Practices

Company	Key Contribution
Amazon	Well-Architected (6 pillars)
Meta	Production Eng, SEV culture
Spotify	Squads/Tribes, golden paths
Toyota	Kaizen, Jidoka, JIT

Mission-Critical Lessons

Industry	Lesson
NASA	Checklists, redundancy, simulation
Aviation	Crew resource mgmt, near-miss analysis
Nuclear	Defense in depth, safety culture
Finance	Ultra-low latency, compliance

High-Reliability Orgs

5 principles from aviation, nuclear, healthcare:

Preoccupation with Failure
Reluctance to Simplify
Sensitivity to Operations
Commitment to Resilience
Deference to Expertise

See HRO Pattern Recognition for deep dive

Key Takeaways

Automate everything: Eliminate manual toil
Embrace failure: Practice makes resilient
Measure what matters: SLOs drive decisions
Culture first: Blameless enables learning

Learn from the Best

Adopt practices, not just tools.