I think that there's something so subtly powerful about being able to take data / logic and surround it with rhetoric and prose. Most tools separate these out between Word / Excel / Powerpoint, but I think that successes of things like dashboarding tools and other interesting kinds of data-driven visualization tools make me really excited about the kinds of tools we should be focused on building as technologists.
Sunday, February 23, 2020
Saturday, February 15, 2020
Incident Response
Part 1 and Part 2 and are here – main takeaways below, but recommend reading the raw papers for more detail. Overall, I think that we do a pretty good job of this, but especially as we move towards more Operational Responsibilty, Baseline, and PRX working closely together, worth reviewing for some ideas of where things fail in a distributed incident response environment. Some of my notes below:
- It was really cool to see the rigor with which they analyzed the response rather than the incident itself – there are a lot of papers about “bad bugs” but less about “collective multi-engineer debugging.”
- Organizational complexity often mirrors system complexity (Conway’s Law) conflating incident response.
- How important decentralized agency / autonomy is in incident response – “plans / actions” are often poorly decomped and require high-trust operators to go off script.
- Reminds me a lot of patterns in immunology with antibodies / tcells / general sympathetic nervous response, as well as trauma triage principles in emergency medicine.
- Willingness to achieve “immune response” and bring in expertise rather that fearing the repercussions of over-escalating.
Thematically, I think that the most important takeaway for me is that preventative design is not enough: while it’s really important to design our software and try to prevent catastrophic failure, it’s probably equally valuable to design our human / organizational response systems in a way that quickly resolves these catastrophes.
Subscribe to:
Posts (Atom)