Capacity & Release Engineering

DORA Metrics, Progressive Delivery & Safe Changes

Capacity & Release | Technical Operations Excellence

4
Deploy Strategies
<5%
Elite CFR
1-5%
Canary Size
<1h
Elite Lead Time

Release Performance Targets

MetricElite Target
Deploy FrequencyOn-demand (multiple/day)
Lead Time<1 hour commit to prod
Change Fail Rate<5% of deploys cause issues
Time to Restore<1 hour to recover

Based on DORA research: elite performers achieve 182x higher deploy frequency

Deployment Strategies

Canary

Route 1-5% traffic to new version, monitor, expand gradually

Blue-Green

Two identical envs, instant switchover, easy rollback

Feature Flags

Decouple deploy from release, targeted rollouts

Canary Best Practices

  • One at a time: Avoid signal contamination
  • 5-12 metrics: Monitor error rate, latency, saturation
  • Absolute thresholds: Define rollback criteria upfront
  • Bake time: Allow sufficient observation window

40%+ of incidents stem from config/deployment errors

Progressive Delivery

Commit → CI/CD → Canary (1-5%) → Rollout → Full Deploy

StageGate
BuildTests pass, security scan
CanaryError budget not exceeded
RolloutMetrics within thresholds

NALSD Framework

Non-Abstract Large System Design - 4 essential questions:

QuestionFocus
Is it possible?Can we build it at all?
Can we do better?Optimize design choices
Is it feasible?Cost, time, resources
Is it resilient?Graceful degradation

Capacity Planning

ComponentApproach
Demand ForecastHistorical trends + growth models
HeadroomN+1 minimum, N+2 for critical
Load TestingRegular stress tests at 2x expected
Auto-scalingHPA/VPA with proper limits

Change Risk Categories

TierExamplesProcess
LowConfig, docsAuto-deploy
MediumApp codeCanary + review
HighInfra, DB schemaChange board

Launch Checklist

  • SLOs defined and dashboards ready
  • Runbooks documented
  • Rollback procedure tested
  • On-call coverage confirmed
  • Load test completed

Ship Fast, Ship Safe

Elite teams deploy frequently with low failure rates.