“The more you sweat in practice, the less you bleed in battle.” – Chinese Proverb.
This is one of my favorite quotes, and I think it’s very fitting for how to think about disaster recovery preparedness and business continuity. I remember many times in my career, especially running online services, when I wasn’t prepared. We were (metaphorically) out of shape, lazy, and pre-occupied with other things. Then a failure hits and customers are affected and everyone scrambles, and everyone wishes they had spent more time preparing for this situation.
Here’s my Top 7 Components of a disaster recovery plan:
1. Contacts & Escalation Process. This seems like a really basic concept, but being able to find the right people during non-business hours requires some thought and planning. Will you have a rotation? Do you have a pager? You want to avoid randomization and get to the right person the first time to solve a problem, but you also need to have multiple contacts to avoid the very real problem that the right person can’t be reached.
2. Diagnosis Guide. What should I do to figure out what happened? Where should I start looking? What are possible false alarms? This will help focus investigation efforts toward the root cause and avoid treating symptoms. You will need to think about how to make your system provide actionable and signal-rich information for faster diagnosis and resolution.
3. Disaster Scenarios. Enumerate likely (and some unlikely) scenarios based on how you need to respond. For example, Denial of Service Attack is very different than Data Center Flooding. As you think through these scenarios, you’ll have to mentally imagine how the team would handle it. Even better, simulate them! In each scenario, make sure you include any findings from past experience. Look for how you can make your system more reslient, more diagnosable, and easier to fix.
4. Whole-System Redeployment. If you needed to bring up a whole new system from scratch on new infrastructure, how would you do it? This will give your team ideas on how to bring it up faster through automation & redundancy.
5. Customer Communications. How will you communicate with customers and what are the guidelines for this communication? How you handle a difficult situation will define your brand to your current and future customers, so you want to avoid blame-game, silence, and confusion that may occur if you figure it out in the heat of battle.
6. Data Recovery Process. How do I minimize the loss of data while fixing the system or redeploying? How do I make sure I have the latest and greatest backup? Losing customer data is not acceptable and creates a huge reputation risk. This must be handled with extreme care.
7. Routine Updates and Simulation! I believe this is the most important step. If you aren’t vigilant in running through your plan and improving it and your systems, you will get caught off-gaurd when a dramatic failure happens. This is too important to be left to chance. Dedicate time & money to being prepared!
I hope this helps you think through your preparedness & prevention strategy. What’s in your disaster recovery plan? What have you done that’s helped you be more prepared?