Applying monitoring principles in SRE
Reliability is a measurable quality. To be able to measure the quality of the systems and their reliability, teams need real-time information on the status of these systems. As mentioned in the previous section, the TTD is a crucial driver in calculating risk and subsequently determining the SLO. Observability is therefore critical in SRE. However, SRE stands with the principle that monitoring needs to be as simple as possible. It uses the four golden signals:
- Latency: The time that a system needs to return a response.
- Traffic: The amount of traffic that is placed on the system.
- Errors: The number of requests placed on a system that fail completely or partially.
- Saturation: The utilization of the maximum load that a system can handle.
Based on these signals, monitoring rules are defined. As the starting point in SRE is avoiding too much work for operations or toil, the monitoring rules follow the same philosophy. Monitoring...