I recently found this slideshare about how to run a post-mortem here: Post Mortems for Humans

I am personally a firm believer that if you want a reliable, problem-resistant system, you have to incrementally improve by asking questions and systematically eradicating problems as they arise. Designing a system in a vacuum to be perfect simply does not work in practice because the subtly and variety of environmental issues are unknown at design time. Really, you have to design in a facility for not having a design for every problem.

The slideshow also talks about humor and using past mistakes to diffuse the tension of a post-mortem. This resonates with me as well, as I try to use humor to break through barriers all the time (probably not apparent from the writing on this blog however…).

Labeling a root cause as “human error” is basically worthless. This is well articulated in the slideshow. In fact, using root causes at all is a fool’s errand. I recently have had this debate with co-workers as we tried to decide which of the several reasons something broke was the One True Root Cause. Who cares? The more important issue is how do we prevent the broadest range of possible future problems. Too many times I have seen people get down to a single root cause and then fix that one exact problem just to have a similar (but new!) issue pop up the next day.

What do you think?

About Kit Merker

Product Manager @ Google - working on Kubernetes / Google Container Engine.
Link | This entry was posted in Disaster Recovery and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s