It’s simple – if you have false alarms people will eventually ignore them.
When you run a high-scale online service, it’s standard practice to have monitors & alerts watching the system so you can quickly find any problems in your system as they happen. Managing problems by exception is key to scaling a service so you can keep an eye on much more infrastructure with less workforce. This is great in theory, but not all monitoring is created equally.
Regardless of what technology provider you use, you have to spend some energy on designing your monitoring system. Optimizing your monitoring will allow your team to focus on real problems with your service to keep it humming.
Infrastructure vs. Application
When you think about your whole system, you have a stack of components that have different problems and remedies. You likely have 3rd parties that are responsible for certain components in the stack (like physical hardware or network infrastructure). And your application is likely one-of-a-kind and you have the expertise on how this really works.
When designing the monitoring system, it’s important to consider the various layers in your stack and how best to inspect each independently. For example, monitoring for hard drive failures, out of memory, etc. is good for the infrastructure provider. If you’re in the cloud (through a public cloud provider or hoster), you likely don’t even have visibility to these kinds of issues anyway.
But your infrastructure may be running perfectly while your app is suffering. The best approach is to add application-level tracing and alerting that tells you about problems in the language of your software domain, and ideally customer centric. Knowing whenever a customer sees an error screen for example, might be useful (especially with a threshold where a certain number of similar alerts trigger a higher-order “meta-lert”).
This data can be incredibly useful not just for problem solving the live site, but also for how to improve your software over time. This can give you better “telemetry” on your users, especially in helping find customer pain.
Less noise, more action
One key to monitoring success is to make your exceptions, errors, & alerts actionable. This means having error messages & codes that tell you not just the symptom, but the cause of the issue, and ideally what to do about it.
For example, consider when a web server runs out of memory. An error message like “Exception: Out of memory” may seem logical, but it’s much more actionable to say something like “Exception: Server ABC ran out of memory during page load. This has only happened during denial of service attacks or when external service XYZ is down, click here to see the knowledge base article.”
In order to build an error message like this, you have to go back and update it as you learn more about running your service. You could include links to your internal knowledge base and now you have an idea what to do (so you don’t have to actually update your code with new causes). The server running out of memory is the symptom, and you have 2 possible paths to investigate to see if it’s the same thing as before, which is the most likely case.
The goal is to have the vast majority of alerts require real attention and all alerts are taken seriously. This will force you to get rid of meaningless alerts and improve the intelligence of your monitoring system. People get numb to reams of meaningless data and the real problems will be ignored along with the monitoring spam.
For further reading, I highly recommend The Art of Scalability.
Here comes the math
It’s really simple to calculate the signal-to-noise ratio for your service. Your “Signal” is the total number of trouble tickets + bugs + deployed quick fixes that were sourced from alerts and your “noise” is the total number of alerts generated by your system, minus the signal. For extra credit you can break this down by “severity” of the issue.
So let’s say you had 200 trouble tickets, bugs, and fixes in the last month, and a total of 4000 alerts. That gives you a “noise” of 3800 and a “signal” of 200, or a noise/signal ratio of 19:1. Not so good. You’ll never eliminate the noise, but at least if you measure it you can see if you’re moving in the right direction.
If you need to scale, you need a well design & optimized monitoring strategy. This is a combination of tools, process, and software that creates awareness of customer-facing errors, application & infrastructure problems, and canaries in the coal mine for major problems. Improving your monitoring can help you improve your uptime and create better value for customers.
What strategies do you use for making alerts actionable? How do you weed out the noise in your alerts? How do you find out about problems before your customers?