Alerting
So we have all the metrics to provide us with some insight into how the system is performing. We've got spectacular graphs and gauges in Grafana but we're hardly going to sit watching them all day to see if something happens. It's time to add alerting to the solution.
What Is an Alert?
An alert is an event that is generated when some measurement threshold (observed or calculated) is about to be or has been breached. The following are some examples of alerts:
- The average system response time in the last five minutes goes above 100 milliseconds.
- The number of currently active users on the site falls below a certain threshold.
- Application memory usage is approaching its maximum limit.
Alerts usually result in notifications being sent to human operators, whether that is through an email or instant message, say. Notifications can also be sent to trigger automation scripts/processes to deal with the alert. Service owners can analyze their...