Writing great alerts using SLIs and SLOs
An SLI is a measurement that is used to indicate a current service level. An example could be the number of errors over a 15-minute period.
It is best practice to keep the number of SLIs small; three to five SLIs for a service is a good rule of thumb to follow. This reduces confusion and allows teams to focus on what is critical for their service. SLIs can also be thought of as a fractal concept; while a service team can have indicators for a component of a larger system, the system can also be tracked by a small number of SLIs – for example, the number of services that are failing their SLOs. By keeping the number of SLIs tracked relatively small, the potential for spurious alerts is reduced, and the impact of continuously monitoring services is kept small. This means more services can be monitored without scaling the tools used and increasing operating costs.
The patterns we discussed earlier of RED, USE, golden signals, and core...