- Mehmet Baris Kalkar
Expectation for software reliability and availability is at an all-time high now. With money and prestige on the line, every application and website wants to avoid downtime while building new things as fast as possible. However, perfect infrastructure or perfect software does not exist, so incidents do happen. What we can do is minimizing the impact by making sure we are responding quickly and efficiently. This makes on-call rotations a part of our lives now.
If structured and compensated correctly, on-call rotations can improve the overall quality and the response time for incidents quite a lot. However, it is not always a positive experience to be on call. One reason for an unpleasant on-call is getting many alerts.
Alarm fatigue or alert fatigue describes how busy workers become desensitized to safety alerts, and as a result ignore or fail to respond to such warnings.
This term did not originate in software, but it is applicable to many industries and works well to describe the problem. It is dangerous for both developers and companies because
- when developers start ignoring all alerts, they also miss or respond later to critical ones
- their productivity suffers
- they are most likely dissatisfied if they are waking up in the middle of the night for alerts multiple times a week
- they are more likely to burn themselves out
How to avoid it?
Create Actionable Alerts
Rule #1 in creating effective monitors is having actionable alerts. If you are not going to take any action when you get the alert, you might need to re-evaluate your thresholds or change what you are monitoring.
We are collecting a lot of metrics from our services, but it does not mean we should create monitors for every single one of them. We should be deliberate while creating more monitors and make sure every alert should be useful.
Misconfigured monitoring is a common cause of operational overload. Paging alerts should be aligned with the symptoms that threaten a service’s SLOs. All paging alerts should also be actionable. Low-priority alerts that bother the on-call engineer every hour (or more frequently) disrupt productivity, and the fatigue such alerts induce can also cause serious alerts to be treated with less attention than necessary.
Delay Non-Urgent Notifications
We really do not need to fix every single issue when we notice them. If something can wait until the morning and we know it in advance, delay the notification until working hours. Some examples for this could be problems in non-production environments, temporarily having longer response times from a non-critical service or any other trivial issue.
Trying to deploy changes at 3AM in the morning, on your own, without any reviewers stresses me out just by thinking.
Set Up Priorities
Not every alert has the same priority. We can have many alerts and monitors but only the critical ones should wake us up or call us. Some can be just push notifications or informational alerts without any notification which are closed automatically if the situation does not get worse. We can also use different priorities with escalations and notification settings.
Owners Should Be On-call for their services
Being on call was a sysadmin/ops task traditionally, but now it is the norm for developers to be on-call for the services they maintain. Ideally, we should only be getting alerts only from services we own.
This structure helps in multiple ways:
- since you have more expertise, you are more likely to know what to do
- you do not need to ask questions to other people so you can act quickly when a situation arises
- you feel more responsible for your service
- you can analyze the impact and priority of problems better
- you are more motivated to fix recurring issues
This changes according to business domain and company but, we might also consider having on call-rotations for non-engineering teams like business or marketing too.
In some situations, having an on-call for more than one services might make sense. Not having enough experienced engineers in a team could be one reason for this because being on-call every other week is an even larger problem.
Create Incident Runbooks
Having incident runbooks is a good way to reduce stress associated with getting paged. Instead of thinking for a solution, we can follow a list of steps. Creating these runbooks also makes sure we are creating meaningful alerts. Because if we can't think of anything to do when we get the alert, it might be unnecessary.
Have a good escalation plan
There might be extraordinary cases where we are not available to respond to an alert. Alternatively, we might need help from our teammates to solve a certain issue. In this case, the person or teams we are asking for help should be already clear to us.
Be comfortable to ask for help
Asking for help should feel natural during incidents. On-call rotations are set up so not everyone is pinged for every single alert. A second person is really useful in most problems because bouncing off ideas or an extra set of eyes helps a lot while solving problems. If the issue actually needs everyone in the team to work on it, they should. This is also a great way to share expertise within teams because incidents are a great source of experience for everyone involved.
Have a clean alerts dashboard
Having a clean alerts dashboard is a good indicator for a team which is on top of their alerts and monitors. If you have dozens of open alerts all the time, you might be missing important alerts or ignoring some problems with your service. If having an empty dashboard is the norm, every alert feels important