By instrumenting our applications and deploying the necessary infrastructure for scraping metrics, we now have the means for evaluating the SLIs for each of our services. Once we define a suitable set of SLOs for each of the SLIs, the next item on our checklist is to deploy an alert system so that we can be automatically notified every time that our SLOs stop being met.
A typical alert specification looks like this:
When the value of metric X exceeds threshold Y for Z time units, then execute actions a1, a2, an
When the value of metric X exceeds threshold Y for Z time units, then execute actions a1, a2, an
What is the first thought that springs to mind when you hear a fire alarm going off? Most people will probably answer something along the lines of, there might be a fire nearby. People are naturally conditioned to assume that alerts are always temporally correlated with an issue that must be addressed immediately...