Proactive Server Monitoring Pitfalls

Svet Stefanov

Svet Stefanov

The following is a guest post by Svet Stefanov.  He is a freelance writer and
a contributing author to WebSitePulse’s blog. Svet is currently exploring the topic of proactive
monitoring, and how it can help small and big business steer clear of shallow waters.

The first step in fixing a problem is admitting you have one.  Fixing problems in software systems is no different. Server issues are not a thing of the past and as the Cloud continues to grow and further meshes into our lives, these problems are not going away anytime soon. Monitoring different types of servers, SaaS, or even a single website is not only a good practice, but a long-term investment. If you have the time, I would like share my thoughts on why it is better to be prepared for the worst rather than blissfully avoiding it.

I’ve heard people say that monitoring is the first line of defense. In my opinion, the first line of defense is a great recovery procedure and embedded redundancy. Adequate monitoring systems have two main functions – to detect a problem and to alert concerned parties. More advanced systems with extensions & complex configuration can also take action without human intervention based on a predefined set of rules.  But even a few small investments in monitoring can yield great improvements in reliability.

Internal or external monitoring?
Server performance can be monitored both internally and externally. Internal (in-house) IT infrastructure monitoring practices involve keeping an eye on resource utilization, hardware health and overall local connectivity. Through different tools, scripts and deployment of end-to-end server hosted monitoring systems, your network attached devices report on their current state and overall well-being. This type of monitoring happens with in a company’s ecosystem, providing all essential data for troubleshooting and prevention of large-scale disasters.

External monitoring involves tracking the availability and performance of your resources from different network locations. Monitoring hardware by running scheduled tests from multiple geographical locations is an effective way to be sure that your business is visible from locations of interest. Apart from giving you information about the overall accessibility of your resources, external monitoring can also help you plan ahead and take educated decisions. One good example would be a new market you are trying to gain ground on. If it takes a considerable amount of time for people to reach your resources, server co-location might be a good idea to consider. Having invested a small amount of money for remote performance monitoring would help you decide whether to spend thousands of dollars or not.

Going for a combination of both is usually the best course of action, but in practice my experience is that this is rarely the case until too late. While even small companies are equipped with (often) rudimentary tools for in-house performance monitoring, external services are employed only after problems have occurred.

Some of the common problems IT staff faces with their server are high CPU demand, heavy I/O usage, running out of memory, many prolonged processes, load spikes, and connectivity issues. All of these problems can be detected locally. This does not mean remote monitoring of network attached devices is meaningless. Connectivity issues and end-user experience simulation is where external monitoring steps in.
With server-based monitoring, problems are usually located with customized scripts, multiple tools, and with the help of logs.  If not well optimized, scripts running on cron jobs can cost a lot of CPU time simply by performing round the minute checks. In such situations, the means of tracking defeat the purpose due to precious CPU time being diverted to systems checks, instead of serving users. Environments, limited by resources and staff, need to make a hard choice – invest in additional hardware and human hours, or opt for external monitoring as a supplement.

Be the first to know

When clients submit tickets and the phones start to ring, this means it is already too late. Early detection is possible through strict monitoring practices. Increasing response time from more than one geographic location is a definite sign that your server is not coping with the demand. With prompt action, this won’t grow into extended downtime periods.
Improving the current organization can help a business skip the next hardware purchasing cycle. External monitoring systems can replace the need of additional hardware resources, but they are not the magic bullet for downtime prevention. My recent experience is my evidence.

When everything goes wrong

A website I worked on, suffered from hardware bottleneck for over a month (at least according to limited server logs), before anyone noticed. The company had several sites hosted on a single server for nearly 2 years. With time, the demand for those sites grew. Due to lack of communication, the IT team had no time, or knowledge, of the upcoming high season. Once the issue was a fact, they had no time to scale with additional hardware, and had to optimize the resources at hand. The focus on the flagship site was so great that all concerned parties overlooked other company website. The main site was running ok, while other sites were losing business due to poor performance. Preventing a single issue led to many other problems. Website access times exceed 15 seconds. Way beyond any industry standard. Forms failed to load or submit content most of the time. This was not visible for people from within the company. Everything was running ok, as far as employees were concerned. The problem was first discovered where the overall amount of leads got cut in half. As a general rule of thumb, IT staff should be aware of such issues before marketing staff.

Remote monitoring, or any monitoring tools for that matter, could have prevented the damage. In one of many conference calls, someone brought up the fact that remote monitoring was already in place, but no issues were detected so far. Much to my surprise, the guy on the other end of the line was completely right. They were tracking uptime, not performance. In this worst case scenario most of the things that could go wrong, in fact, did.

In order to resolve the problem with the resources at hand, we decided to go for some quick fixes. Here are a few of the things we did:

  • Load balancing – it turned out the IT folks had been working on this for quite a while and we had to rush the implementation. It was a bit of a gamble, but a necessary one. This is tricky stuff and we were lucky to have experienced staff on the task.
  • Server cache control – this was argued for some time, but at the end implemented within one afternoon. This greatly reduced the load on the servers. All web forms were yet to undergo much-needed optimization so we had no choice but to cache whatever we can.
  • Serving static content from a different host – We moved all static content to another server, not physically, but we had another type of webserver managing request for static content (images, js, flv, audio, etc.). We used lighttpd, also widely used by Wikipedia and YouTube. This took some load from our main webserver (busy putting together webforms) and improved the speed just enough.
  • Suspended all secondary operations – All managers were alerted to stop downloading reports from company databases. Because it was the end of the month, all managing staff was knee-deep in sales reports, revenue calculations and what not. The main CRM was also hosted on the main server. Stopping heavy queries improved our chances to server customers in peak times.

To make a long story short, after we sorted out a couple of expected collateral glitches, everything was running better than before. The system was flawed in many ways, but the biggest mistake was focusing the attention on a single issue – the well-being of one site, while neglecting the rest. This is one lesson the company had to learn the hard way.

Conclusion

Simply having monitoring systems is not going to protect the business if you don’t have a well thought out design & plan. Identifying the key performance indicators and monitoring them constantly is mandatory for great up-time. If you can find simple ways to get ahead of the problems that appear you will hopefully be able to recover before any real damage has been done.

Advertisements

About Kit Merker

Product Manager @ Google - working on Kubernetes / Google Container Engine.
This entry was posted in Business Continuity, Cloud, Disaster Recovery, Downtime, Technology, Uptime and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s