Doing by example – lightweight alert manager
Service monitoring and alerting is an essential practice for any infrastructure. We use a lot of tools to collect active and passive check results, metrics, and more. Once a problem has been detected, it is essential to alert the owner of the broken system. The solution we are designing is for alerting for an issue that’s been detected in any monitoring systems. These issues are called incidents, and they need to be acknowledged by an on-call engineer, who is responsible for managing a service’s infrastructure. If the on-call fails to acknowledge the incident (and subsequently work on it), the issue should be escalated to the team manager or other leadership, depending on the escalation policy defined.
High-level solution design
The following diagram summarizes the implementation:
Figure 3.6 – Alert Manager architecture diagram
Let’s take a quick look at how this works...