Musings on Recovery Oriented Computing: Part 2

Software Failure

Photo by fireflythegreat

This is the second post related to Recovery Oriented Computing.  Here’s where you can find the first one.  I’m also throwing in another scalability/resiliency topic not from ROC.

Design for Failure

“If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time.”

–Shimon Peres

Many software systems are not designed for failure.   The traditional software development lifecycle includes a heavy testing & quality assurance phase to try and weed out bugs and eliminate failure points.  Of course, this is too late in the process to actually improve quality, everyone knows that quality comes upstream and has to be “designed in”.

What if “failure” in a system is actually not a problem, but a fact?

Assuming that’s true, quality in your system is not defined as “avoiding failure” but rather as “expecting failure as inevitable, and coping with it.”  Coping could mean not spreading failure further, failing fast, and not having unique and irreplaceable data when failing.  The point is that your design mode will change from being focused on prevention and instead moves you into the realm of preparation and recovery.   You can also apply this same thinking to disaster recovery and business continuity at a much higher level.

Software Recoverability Rule of 3

1 for your customer, 1 for you, and 1 to fail.

When you know a system will fail, and you want to achieve constant uptime for your system, you need to plan ahead for how your system will be used.

  • 1 for your customer – the most important use of your system is by users, paying or otherwise.  You should always have 1 instance of your components set aside for the exclusive use of your customers.  You don’t touch it ever.
  • 1 for you – you’re going to need to be able to investigate, test in production, roll out upgrades & patches, etc.  This instance is all yours and customers will not touch it while you’re using it, giving you great freedom to do as you will.
  • 1 to fail –  Failure is inevitable in your system, so throw in an extra instance just for failure.  Otherwise you might have to infringe on the 1 for your customer when you need to investigate.

Keep in mind that you can switch the role of these instances as you are managing & operating the system.  But if you don’t plan that you will need instances to be in these specific roles or you create confusion about what role each is in at any given time and could lead to hurting customer experience.  Or you might create contradictory results during investigation.

Other topics?

What do you want to hear more about?  What are your biggest concerns related to software disaster prevention & preparation?  Drop me a line or leave a comment!

About Kit Merker

Product Manager @ Google - working on Kubernetes / Google Container Engine.
This entry was posted in Business Continuity, Cloud, Disaster Recovery, Downtime, Technology, Uptime and tagged , , , . Bookmark the permalink.

5 Responses to Musings on Recovery Oriented Computing: Part 2

  1. BJacks says:

    So how can you get customers to accept the inevitability of failure? That seems like a difficult message to market….even if it is true.

  2. Kit Merker says:

    Great question. A couple things come to mind. First, you might actually build credibility by being an honest vendor that says “failure is inevitable at a component level, but we know how to make it work overall for your business”, rather than just making promises that you can’t keep.

    Second, if you’re running a SaaS service, website, etc., you may not have to “disclose” this to customers since it’s really just an operational issue that will lead to higher availability and preparedness for disaster.

  3. Pingback: Stop the Dominos from Falling | Software Disasters

  4. Pingback: 5 Stages of Grief |

  5. Pingback: Thoughts on Windows Azure Leap Day Downtime |

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s