“a safe structure will be the one whose weakest link is never overloaded by the greatest force to which the structure is subjected” Petroski 1992
A couple years ago, I had the pleasure of seeing Dave Patterson speak about recovery oriented computing at a Microsoft event. I was looking up some of his presentations online today and stumbled on this quote. I thought I’d briefly share this and a few things I remember from the talk that stood out to me. I’ll break this up into a few posts.
If you have a system, it’s going to fail. In order for a system to scale up to large demands, it needs to be able to avoid and recover from failure. One way to solve this problem is through automation of recovery. The usual approach to this is to take all the repetitious, common tasks, and write software or build robots that do it for you.
Of course, now that all the easy common tasks have been automated away, that leaves a subset of management tasks that are diffcult, upredictable, or rare. And since your administrators don’t spend very much time managing the system, they are relatively unfamiliar with it and may not be able to resolve these problems.
This creates a catch-22! If you want your system to scale reliably, you have to automate tasks, but if you automate tasks, your system may become catastrophically unreliable.
To solve this, you need your team to play an active role in operating and maintaining the system. Of couse, automation plays a role, but think about how you can create leverageable tools vs. a humanless sealed room. Make people responsible for managing various aspects of the system, creating better actionable monitoring, improve your designing to expect failure, and think about the whole system – including the “soft”, “hard”, & “wet”.