Organizing an incident postmortem

Two weeks ago we had a big incident at work. The incident triggered a series of events that resulted in a big loss. One of my colleagues organized a postmortem that was interesting. Here is  how he did it:

  • Gather everyone involved with the incident in a room.
  • What?
    • Give each person a few post it notes, ask them to write down the sequence of events from their point of view. Not the things they discovered after, and not things that didn’t happen to them
      • Good example: “On Monday at 10AM, I saw this alert.”
      • Bad Example: “The server stopped working at 8AM”. No one experienced the server not working, but someone experienced the alert.
    • Build a timeline. Collect each person’s notes and post them on the wall. If more than one person experienced the same event you can just use one of them.
  • Why?
    • Root cause analysis: Now that you know what happened from people’s point of view, you need to understand what really happened.
      • Ask each person to write down the root cause they know for events from the timeline on the wall (e.g. There was a bug in X module that resulted in the server shutting down, hence triggering the alert).
      • Collect these root causes and stick them next to their respective events.
  • Solution
    • For every root cause, discuss how this can be prevented in the future, whose responsibility to take care of it.
    • Some root causes will be unknown, needing further investigation. This is also a good time to discuss investigation tasks.

What I like about this method is it separates the technical reasons from the process reasons. Sometimes there is a small bug, but with the wrong process it takes long time to react, other times, something major happens, but because the process is right, people react to it quickly. Separating the process and the technical reasons during the analysis phase helps fixing both.