Scott Hamlin is a local business owner and IT professional that has been serving IT customers since the early 90′s. I recently had the opportunity to ask him a few questions about PacketDrivers and his views on improving software resiliency.
The full interview follows:
Can you give some background about yourself and your company?
I started PacketDrivers in 1997 with my business partner Bob Benshoof. We originally created the company with the intent of providing “network” services to small and medium size business. At the time it was very technically focused hence the name “PacketDrivers” because we drove the packets on your network.
As our company matured we began to see our role as a trusted advisor to our client base and realized in truth we were a company that was uniquely positioned to not only give “business” advice as it related to technology, but that we could be the IT Department for those same organizations. We evolved to where we are today primarily playing the role as the IT department in a fixed-fee IT outsourcing role.
What’s your job really like? How do you spend your time?
We are still a relatively small organization of 13 people. Although I spend time thinking strategically about our business I still serve a CIO type of role with many of our clients. In our monthly or quarterly business reviews with clients we discuss the primary opportunities and challenges our clients have not only directly with IT issues, but overall business issues. This allows me to provide direct input on how IT solutions may be able to help with these challenges, enhance the opportunities and in some cases provide input that IT is not the best or most cost-effective solution.
What advice do you have for IT managers or business owners that want to improve availability and resiliency?
The most important step is to locate the systems and applications that the entire organization relies on to produce their product or service and what the costs of downtime are. Although most organizations continue to increase their reliance on IT systems there are still plenty that produce much of their actual product or service without direct reliance on IT systems.
It’s easy to falsely assume that the biggest systems and applications are the ones that will cost the most in both real dollars and opportunity costs if they fail, but that is not always the case. Assess first the business impact and then you can focus your preparedness to handle outages and disasters.
Another common mistake is to plan for what we think of as a traditional disaster when in fact most outage and down time is related to issues around logical problems with upgrades to operating systems, applications etc. If your data center burns down, it’s easy to walk in and determine exactly what the problem is and create a plan for recovery.
It is much more difficult when the users in the organization start reporting that the data that is coming from remote retail stores into the central system is causing the distribution management to suddenly make no sense and the organization cannot ship product. Where is the real problem? The scanners at the retail stores, data coming from registers, an export/ import process from the local store databases, a database corruption etc.
Can you give an example of being prepared?
An excellent example of properly identifying the systems that represent the greatest risk exposure can be seen in a manufacturing company that we have worked with. The organization has a manufacturing system to manage much of their operations. They use virtualization with failover options, system backups, redundancy in many places, UPS’s etc. and have spent some time thinking about those issues. They also have a single PC’s out in the manufacturing plant (a location that is hot, dirty etc.) that is used to do calculations for melted steel.
It turns out that if you start the process and suddenly cannot utilize the PC to control the process there are 10’s of thousands of real dollars (for electricity, material etc.) that are completely wasted in the process. Conversely most of the manufacturing process can continue even if the entire server room were to burn down. In this case there was an actual failure of the PC during the process (in the middle of the night) and the process had to be aborted costing the organization plenty.
The solution was to not only have a spare completely configured with the all of the necessary software but to also train the staff to swap the machine if necessary. A quarterly test and swap of the workstation along with the necessary training was also implemented. This naturally led to an evaluation of other single points of system failure in the plant and recovery plans for each.
What’s the toughest part of making software systems resilient to failure?
Software is much more difficult to manage in terms of resiliency and failure than hardware. If a hard drive fails we know exactly what happened and what we need to do. Applications on the other hand can be considered more of an art than a science. When a hard drive fails 99.9% of the time you put a new hard drive in the system and the hard drive is complete functional. Restoring the system may be more complex, particularly if you have to move to a new server.
Drivers may be incorrect, how the new server connects to the infrastructure for input from other systems and other factors make that portion of the recovery much more difficult. Applications today are often integrated with many other applications to exchange data. What if some of the data is corrupt? How do you identify the actual problem? With the input of data into an application from other applications how do you ensure that other components of the application can function without that input?
What have you learned in the last year that’s made your customers’ systems more reliable?
When we are strictly talking about applications themselves the most important factor is great documentation on all of the systems but most importantly documenting how they are integrated. We have clients that have one important line of business application that essentially stands on it’s own.
On the other hand we have clients that may have 6 separate applications and each relies on data from some or all of the other applications some of which may be hosted at the client data center and some are Software-as-a-Service (SaaS) such as SalesForce.com. In this case we have found that it is essential to have diagrams of the critical components for data flow and identifies the point of failure that will not only effect a single application but has a cascading effect. This is a great tool for beginning the business impact analysis and eventually planning the recovery process.
If I am concerned about the resilience of my software systems, what should I do first?
Take the time to begin the process of a business impact analysis. There are plenty of resources online to at least provide the initial framework for a plan. That being said, historically the industry has had a focus on hard systems and major disaster planning. Failures that do not represent a total failure or actual physical disaster are much more common.
Do an analysis of the systems with an eye for more subtle failures such as data corruption, internal information disclosure, hackers, etc. and then test how well prepared the organization is for these scenario’s – they are more likely to happen than a regional disaster such as an earthquake (an event that likely involves your customers who will also be preoccupied with their own issues anyway.) If you are not comfortable doing this yourself, consider looking for outside expertise to provide assistance.