If we assume that “software will fail” is a fact, not a problem, then we need to look at how to cope with the failure. I’d like to discuss one coping strategy which is isolating components.
Some failures cause ripple effects that spread through an entire system, in some cases bringing down other components, or perhaps damaging irreplaceable data. The larger & more interconnected your system, the more of an impact the “domino effect” will have.
Older systems were designed “monolythically” as one system with sub-components that were more or less neatly contained within an overall system hierarchical design. External connections were rare and usually required cumbersome connections or processes to transfer data electronically or otherwise. The world has changed though.
We’re all connected
For today’s modern applications require distributed communications, reliance on external services, and high-availability expectations, you have to architect & design accordingly. And as people and businesses become more dependent on software for their daily operations – everything from calling mom and placing orders to reading books and running a website – the fragility only increases.
So how do you effectively separate the dominos? It’s going to depend a lot on your system, but there are patterns that can help. Following a Service-Oriented Architecture (SOA) will help to isolate and separate your components. This is a common design pattern that breaks applications into service boundaries that communicate over loosely-coupled interfaces.
Using stateless servers will also make you more resilient to domino effects, because each server is operating independently and they can cover for each other when there is a failure.
Database sharding and horizontal partitioning will also allow you to separate your customer data into multiple sites or instances, reducing the risk of data corruption across your systems. If one set of data gets hurt or lost, at least it won’t affect everyone.
Keep canned food on hand
When you isolate components, you need to prepare them to operate for some period of time without connection to dependent services. This is analogous to having plenty of fresh water, food & other supplies during a storm. Applied to your application, this might mean caching critical data locally in case a dependent service becomes unavailable.
When you have multiple copies of data, it’s very important that you have a strategy to manage versioning & master/slave relationships. Having slightly stale data is better than nothing in a disaster scenario, but it’s not sustainable forever. At some point the component needs to get updated so it can present fresh data to users.
The question is not whether software will fail, but rather how you handle it. Just like a pandemic spreading through inadvertent interactions between people, failures can quickly infiltrate your system if you aren’t careful.
Design for isolation & independence and you will keep your failures from spreading to system-wide meltdowns. There are many patterns you can apply to your system design that will improve scalability & resilience, although there is no silver bullet.
What have you found improves reliability & resilience? Have you been able to build a high-scale system that isolates failures? What’s the biggest challenge you’ve faced with scaling a system?