Behind the Curtain: The Anatomy of a Real Major Incident
Failures are inevitable. But when they occur, our goal should be to resolve them as quickly and efficiently as possible. The PagerDuty Community has developed an open-source DevOps approach to Incident Response based on the Incident Command System (ICS). That product-agnostic process has helped many organizations set up Incident Response processes that resolve technical issues as quickly and effectively as possible. There’s a lot of documentation you can follow, but how do incidents actually play out in real-time when they happen?
In this talk, you get to peek behind the curtain to see what happens inside the walls of the company that’s known for waking you up to tell you there’s a technical incident you need to deal with. We will walk through a cascading failure that happened when unexpected errors were found happening in one of our Kafka clusters. I share all the gritty details of this actual incident, as they occurred, and use that as a way to demonstrate the structure of how we apply ICS to resolving technical problems for DevOps teams.
Attendees of this talk will walk away with an understanding of how to effectively manage complex technical incidents across multiple DevOps teams, the additional roles that are necessary to support technical responders, how to structure a blameless post-mortem, and how to start developing a similar process in their own organizations. This talk delves into technical concepts due to the nature of the problem, but it is mostly focused on the mechanics of managing any technical incident with a DevOps team.
About George Miranda
George Miranda is a Community Advocate at PagerDuty, an infrastructure engineer, and a former EMT & First Responder. He is passionate about the systems we use to effectively manage crisis situations. He is the author of the O’Reilly eBook, “The Service Mesh: Resilient Service-to-Service Communication for Cloud Native Applications”.
Prior to working for software vendors like Buoyant and Chef Software, he worked as an on-call engineer for over 15 years. He enjoys his time roaming the world as a Digital Nomad (with a permanent home in the American Pacific Northwest), small batch artisanal whiskey, and writing third-person biographies no one reads.