Alerting on work metrics
We set out to turn all our observability data into actionable information so that teams can experiment with the confidence that they can fail forward fast. But we need to avoid alert fatigue. If we over-alert, then the alerts become noise and the team will start to ignore them. As a result, their confidence will go down, their lead times will go up, and innovation will stagnate.
Instead, we need to zero in on the high-value metrics and only wake the team up in the middle of the night when these Key Performance Indicators (KPIs) go sideways. However, it is hard to determine which of the many metrics are the key metrics. So, let’s start with some examples.
Netflix provides the classic example (https://netflixtechblog.com/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a). They have identified one metric as the most important indicator of a significant problem with their system. This one metric is the rate at which users press the Play button. They...