According to a new survey from CA Technologies, global organizations are losing 127 million hours of employee productivity due to IT downtime & outages. This translates into North American businesses losing on average $159K per year, and suffering at least 10 hours of downtime. In Europe, the survey estimates that businesses are losing €17 billion/year because they aren’t protecting mission-critical systems from downtime.
The figures are astonishing, and this shows that even with advances in technology we aren’t even close to solving this problem. Even the promise of public cloud platforms does not fix it. Take a look at the Amazon Web Services downtime counter. You need to protect your application.
With cloud, you may not have to worry about the hardware any more, but now you are a relatively anonymous tenant in a massive cloud operating system. At least before you had people who really cared about you running the systems. Now you have to be ready for unexpected systemic failures or you will become a helpless victim watching your service go down and customers rethinking their options.
What can be done? I don’t pretend to have the answers to this. Prioritizing the investment in prevention & preparedness are a key first step, and this survey may help business owners to justify the spend.
Even if you can invest, this still leaves the question of how to spend your prevention dollars. Better monitoring? More geographically dispersed sites? Active-Active data center replication? Rewrite your application for Recovery-Oriented Computing? As with all decisions of this sort, it depends.
In my opinion, you should start by evaluating your weakest links and highest likelihood failure points and see what can be done inexpensively and quickly to shore them up. You may not have the expertise (or perspective) in-house to really evaluate this, so try to find a trusted third-party that can look objectively at your system and organizations.
From my experience, I’ve seen that people really make the difference in running a service that maintains rock-solid uptime and reliability. Make sure your team is always on the lookout for danger zones and generating ideas for how to run the service better. Continuous improvement is key as the threats, priorities, and systems are always in flux.
I’d love to hear from you about what you’re doing to prevent, prepare for, and handle software disasters. What advice do you have for fellow IT professionals?