Architecting SRE using KPIs
Before we dive into the definition of KPIs, we need to get back to the basic principles of SRE. SRE teams focus on reliability, scalability, availability, performance, efficiency, and response. These are all measurable items, so we can transform them into KPIs. In this section, we will learn how to do that using SLOs, Service-Level Indicators (SLIs), and the error budget.
The main KPIs that we use in SRE are as follows:
- SLOs: In SRE, this is defined as how good a system should be. An SLO is much more precise than an SLA, which comprises a lot of different KPIs. You could also state that the SLA comprises a number of SLOs. However, an SLO is an agreement between the developers in the SRE team and the product owner of the service, whereas an SLA is an agreement between the service supplier and the end user.
The SLO is a target value. For example, the web frontend should be able to handle hundreds of requests per minute. Don't make it too complex...