Musings on Recovery Oriented Computing: Part 1

Weak Road
Photo by Tim Green aka atoach

“a safe structure will be the one whose weakest link is never overloaded by the greatest force to which the structure is subjected” Petroski 1992

A couple years ago, I had the pleasure of seeing Dave Patterson speak about recovery oriented computing at a Microsoft event.  I was looking up some of his presentations online today and stumbled on this quote.   I thought I’d briefly share this and a few things I remember from the talk that stood out to me.   I’ll break this up into a few posts.

Automation Irony

If you have a system, it’s going to fail.  In order for a system to scale up to large demands, it needs to be able to avoid and recover from failure.  One way to solve this problem is through automation of recovery.  The usual approach to this is to take all the repetitious, common tasks, and write software or build robots that do it for you.

Of course, now that all the easy common tasks have been automated away, that leaves a subset of management tasks that are diffcult, upredictable, or rare.  And since your administrators don’t spend very much time managing the system, they are relatively unfamiliar with it and may not be able to resolve these problems.

This creates a catch-22! If you want your system to scale reliably, you have to automate tasks, but if you automate tasks, your system may become catastrophically unreliable.

To solve this, you need your team to play an active role in operating and maintaining the system.  Of couse, automation plays a role, but think about how you can create leverageable tools vs. a humanless sealed room.   Make people responsible for managing various aspects of the system, creating better actionable monitoring, improve your designing to expect failure, and think about the whole system – including the “soft”, “hard”, & “wet”.

To be continued…

About Kit Merker

Product Manager @ Google - working on Kubernetes / Google Container Engine.
This entry was posted in Business Continuity, Disaster Recovery, Downtime, Technology, Uptime and tagged , , , , . Bookmark the permalink.

1 Response to Musings on Recovery Oriented Computing: Part 1

  1. Pingback: Musings on Recovery Oriented Computing: Part 2 | Software Disasters

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s