Production as a Learning System
8/13/2025
operations · reliability · observability
TL;DR
Design production to teach you: fast, safe, and specific.
Moves
- Counterfactual alerts: alarms that propose explanations.
- Explorable dashboards: drill from business to system views.
- Chaos drills: rehearsed failure with success criteria.
- SOAR postmortems: strengths, opportunities, actions, results.
Checklist
- What hypothesis does each alert test?
- Can we turn any outage into a 30-min learning artifact?
- Do we practice failures the system is designed to survive?