Part 1 and Part 2 and are here – main takeaways below, but recommend reading the raw papers for more detail. Overall, I think that we do a pretty good job of this, but especially as we move towards more Operational Responsibilty, Baseline, and PRX working closely together, worth reviewing for some ideas of where things fail in a distributed incident response environment. Some of my notes below:
- It was really cool to see the rigor with which they analyzed the response rather than the incident itself – there are a lot of papers about “bad bugs” but less about “collective multi-engineer debugging.”
- Organizational complexity often mirrors system complexity (Conway’s Law) conflating incident response.
- How important decentralized agency / autonomy is in incident response – “plans / actions” are often poorly decomped and require high-trust operators to go off script.
- Reminds me a lot of patterns in immunology with antibodies / tcells / general sympathetic nervous response, as well as trauma triage principles in emergency medicine.
- Willingness to achieve “immune response” and bring in expertise rather that fearing the repercussions of over-escalating.
Thematically, I think that the most important takeaway for me is that preventative design is not enough: while it’s really important to design our software and try to prevent catastrophic failure, it’s probably equally valuable to design our human / organizational response systems in a way that quickly resolves these catastrophes.
No comments:
Post a Comment