Monitoring & Alerting done right
Alerting in the tech world is a controversial topic. Everyone should have automated alerts when things go wrong. Reality is a bit more disappointing, with too many or flaky alerts to none at all. The aim of this post is to give you a better idea of the ideal setup. It will help you by showing you a better monitoring approach, without going into any specific tool.
Let's take a step back
Foremost, let's take a look at the bigger picture and terms. Monitoring watches system health & performance over time. Alerting notifies when predefined thresholds or conditions are met, prompting action.
Real-world example
So let's get more concrete: You want to have more insights about your server.
To monitor, you collect the data using a script that periodically reads the current system resource usage and writes it to a database. Based on that data, you build dashboards to be able to see graphs over time, etc.
In the past, it happened that your disk ran out of space. So you create an alert that fires when the disk capacity is over 75% for two hours.
The idea
You should notice every failure that results in a noticeable outage for the user or hints at a configuration issue. Not the customer or anyone else should be contacting you that your system is broken. Automation is key here. It is not acceptable for the customer or anyone else to contact you about your system being broken.
What is enough?
That's a tough question to answer: What is enough in terms of alerting and monitoring? And when do I overdo it? Some people will have the urge to say now, “There is no such thing as too many alerts!” I strongly disagree.
To answer this question, you should ask yourself first a few very crucial questions to find what's right for you. So, let's go over them point by point.
What do I want to monitor?
Nowadays, you can collect data about almost everything. Depending on the ecosystem, there are tons of instrumentation, exporters and what not. Storage is getting cheaper by the day, so we tend to store all we can get. And don't get me wrong, that is wonderful!
Track all the information you can get and afford to store. And build alerts around the most important ones. Overall, it is the combination of having the insight on what happened, backed by data and being notified about the occurrence of such.
In general, you should not start by measuring resource vitals first. Go from the user perspective. What do you expect the service to do and in which way? Based on that, find metrics that reflect exactly these things.
Collect what you can, only alert if there are actual problems.
What is the least acceptable state?
And when I mention the least acceptable state, the absolute minimum is meant. A very drastic example would be:
It's fine that my service uses plenty of resources even in idle. As long as the service responds fast enough to the user, everything is fine.
In this case, when the service is almost burning all the resources it has available as long as my homepage loads in < 1s, which is the least acceptable time for a user to consider it to be fast, it is okay. Of course, that is not the ideal state you want to have, but when this is not fulfilled you should be acting.
Of course, you would rather not be notified when things are already on fire. So in this case, it would be a good idea to set a realistic goal, like 95% of the requests are < 700ms. If that's not the case, I need to take a look. This threshold should be based on measuring. What is your average response time? What is your response time when you have the highest load? Usually, the highest load should be the upper limit. That way, you ensure that you don't get alerts for regular spikes that happen.
Who am I annoying with my alerts?
Nowadays in the DevOps world, the motto is You build it, you run it. Unfortunately, that is only partially the truth in most companies. So really consider who gets the on call alert. Imagine someone being called out of bed at three in the morning, only to see an alert “oh yeah the CPU load is really high, just FYI”. This person will find the creator of the alert with passion.
Being woken up by such an alert typically makes people really mad, and you can hope to not work in the office next to them the following day! ;)
At the same time, it is important to keep in mind that alerts typically mean a high-stress situation. Things are not working as expected and might (or have already) blown up. The responder wants to get things running again as quickly as possible.
Even if you are that paged person, make sure your alert also contains information on how to approach the problem. This can be done with so-called run books (basically a step-by-step guide on what to do). More in detail about it later.
Putting together what is enough
So to sum up, you should ask yourself:
- What is the least acceptable state?
- Who are you annoying with the alerts?
- What do I want to monitor?
This should result in a compact list with anomalies you want to be notified about. Before you get technical, just collect, and see what's possible later on. Less is more here. Rather, have a few very significant alerts instead of a lot of noise.
How to monitor
I talked a lot about alerts now, but how do I actually get all this stuff?
Keep costs in mind
As mentioned earlier, collect what you can and makes sense to store. Storage is getting cheaper, network traffic is still (relatively) expensive and all information that you collect need to be transferred, processed and stored somewhere.
Many cloud providers even charge per metric, alert, or both. So consider especially the cost aspect. This applies to on-premise the same way it does for every cloud provider.
Consolidate at one central place
As far as possible, try to have as few systems storing information as possible. Having one central place saves plenty of questions on how to join data and where it is located.
Where applicable, invest some time getting data from all systems you use, may it be VMs, APIs, Metrics, or specific pre-built exporters.
Automate all the things
Collection of data should be an automatic thing that just happens, you don't have to care too much about.
Build dashboards
Based on your data, build tailored dashboards for the things you want to have insights into. This is especially important to put context to alerts that you create.
For example, a slow system has a reason. It could be due to host problems, application problems or external infrastructure. Providing this information ready to use and reason about is key.
How to alert
Now you have it all together, your list of alerts is ready, they are ensuring your systems are healthy, and the users are happy.
Avoid mails
Don't reinvent the wheel and for the sake of hell, don't spam users with mails. There are wonderful incident management systems out there which do a good job. Moreover, sending to your chat platform might be the right approach for you. E-Mails are unreliable and provide very limited possibilities for interactivity, formatting, and communication.
Prevent Pager Fatigue by taking good care of your alerts
There is nothing worse than having noisy or nonsense alerts. When people get a ton of nonsense or flaky alerts, they tend to see them as low priority. When now a high-priority alert arrives, it is treated as “just another annoyance” and does not get the attention it needs.
There is no tool that can take this over completely for you. Take a look at flaky alerts and maintain them regularly! Ignoring alerts is not an option. It should be the exception and always ring an alarm bell when one appears.
Absolute values are always a bad sign
Everything nowadays needs to scale. So relying on hard numbers is always a bad practice. Rely on relative values and percentages wherever possible to make sure you don't have to update your alerts regularly depending on your system load or the moon phase.
For example, it is a bad practice to alert for a given number of events to happen in a certain time frame. As this changes with the amount of load.
Write Runbooks
A run book is a list of actions to take when an alert is fired. It provides clearly why this alert has been fired and what possible actions you can do to perform it. Be explicit here and update when things are missing. It's important to always add information right away when you notice they are missing. If it is in the middle of the night, at least write a note for the next (work) day.
Depending on what actions to perform, you might also consider automating certain actions, using webhooks, scripts etc. While this does not work for all cases, it is a nice thing that saves time and keeps headspace free for the actual problem, avoiding human errors.
It is significant to not confuse Run- and Playbook. A Playbook focuses on a bigger thing like overall incident processes and can also involve multiple people. A Runbook is focused on a single process, for a single person. So write it accordingly, so one can work on it alone, without having to involve other people. It is always good to also assume the person reading it does not know about every single detail of the system / component that is affected.
Where do I begin?
That's a ton of stuff to consider. Start slow and incrementally introduce monitoring and alerts where it makes the most sense. Go from there, gradually ramp up the things you collect, and extend alert on more systems.
The essential thing is to start. Rinse and repeat.
I already have alerts in place, but I am not happy!
This is something that I hear quite frequently. Most commonly, the issue is too many alerts, too much noise — the pager fatigue. But also sometimes significant outages are only reported by users, which is, typically, rather embarrassing.
In either case, take a step back. Find the actual problems and fix them. Spoiler: Disabling alerts or setting the threshold to an “absurd high number” is not the solution.
I have no infrastructure at all for monitoring yet
Before you jump on the first SaaS tool, you can find, carefully consider what you want it to do and monitor. Moreover, consider what your systems can export in the easiest way. The easier it is to get the data out, the better. Pick your system to store the collected data based on that.
Important aspect here is also what about custom integrations? — When a system does not really provide information out of the box, you might have to build stuff yourself. It is very likely someone already had to go through this as well. So if there is already a lot of prebuilt stuff out there for your system of choice, it might be a suitable solution to go with.
I am curious — What is your current approach to monitoring?
Is there something holding you back? What is your approach, and which things do you find significant?
You are welcome to drop me a mail, write on LinkedIn or drop an old-fashioned comment here.