Design for Failure
“If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time.”
Many software systems are not designed for failure. The traditional software development lifecycle includes a heavy testing & quality assurance phase to try and weed out bugs and eliminate failure points. Of course, this is too late in the process to actually improve quality, everyone knows that quality comes upstream and has to be “designed in”.
What if “failure” in a system is actually not a problem, but a fact?
Assuming that’s true, quality in your system is not defined as “avoiding failure” but rather as “expecting failure as inevitable, and coping with it.” Coping could mean not spreading failure further, failing fast, and not having unique and irreplaceable data when failing. The point is that your design mode will change from being focused on prevention and instead moves you into the realm of preparation and recovery. You can also apply this same thinking to disaster recovery and business continuity at a much higher level.
Software Recoverability Rule of 3
1 for your customer, 1 for you, and 1 to fail.
When you know a system will fail, and you want to achieve constant uptime for your system, you need to plan ahead for how your system will be used.
- 1 for your customer – the most important use of your system is by users, paying or otherwise. You should always have 1 instance of your components set aside for the exclusive use of your customers. You don’t touch it ever.
- 1 for you – you’re going to need to be able to investigate, test in production, roll out upgrades & patches, etc. This instance is all yours and customers will not touch it while you’re using it, giving you great freedom to do as you will.
- 1 to fail – Failure is inevitable in your system, so throw in an extra instance just for failure. Otherwise you might have to infringe on the 1 for your customer when you need to investigate.
Keep in mind that you can switch the role of these instances as you are managing & operating the system. But if you don’t plan that you will need instances to be in these specific roles or you create confusion about what role each is in at any given time and could lead to hurting customer experience. Or you might create contradictory results during investigation.
What do you want to hear more about? What are your biggest concerns related to software disaster prevention & preparation? Drop me a line or leave a comment!