Objectives – SLOs/SLIs
Service-level objectives (SLOs), which is a concept that’s referenced many times in Google’s SRE handbook, can be a great help to set your direction from the start. Choosing the right objective, however, can be trickier than you might think.
My personal experience aligns with Google’s recommendation, which suggests that an SLO – which sets the target for the reliability of a service’s customers – should be under 100%.
This is due to multiple reasons. Achieving 100% is not just very hard and extremely expensive, but almost impossible given that almost all services have soft/hard dependencies on other services. If just one of your dependencies offers less than 100% availability, your SLO cannot be met. Also, even with every precaution you can make, and every redundancy in place, there is a non-zero probability that something (or many things) will fail, resulting in less than 100% availability. More importantly, even if you could achieve 100% reliability of your services, the customers would very likely not experience that. The path your customers must take (the systems they have to use) to access your services is likely to have less than 100% SLO.
Most commercial internet providers, for example, offer 99% availability. This also means that as you go higher and higher, let’s say from 99% to 99.9% or IBM’s extreme five nines (99.999%), the cost of achieving and maintaining this availability will be significantly more expensive the more “nines” you add, but your customers will experience less and less of your efforts, which makes the objective questionable.
Above the selected SLO threshold, almost all users should be “happy,” and below this threshold, users are likely to be unhappy, raise concerns, or just stop using the service.
Once you’ve agreed that you should look for an SLO less than 100%, but likely somewhere above or around 99%, how do you define the right baseline?
This is where service-level indicators (SLIs), service-level agreements (SLAs), and error budgets come into play. I will not detail all of these here, but if you are interested, please refer to Google’s SRE book (https://sre.google/books/) for more details on the subject.
Let’s say you picked an SLO of 99.9% – which is, based on my personal experience, the most common go-to for businesses these days. You now have to consider your core operational metrics. DevOps Research and Assessment (DORA) suggests four key metrics that indicate the performance of a DevOps team, ranking them from “low” to “elite,” where “elite” teams are more likely to meet or even exceed their goals and delight their customers compared to “low” ranking teams.
These four metrics are as follows:
- Lead time for change, a metric that quantifies the duration from code commit to production deployment, is in my view one of the most crucial indicators. It serves as a measure of your team’s agility and responsiveness. How swiftly can you resolve a bug? Think about it this way:
- Low-performing: 1 month to 6 months of lead time
- Medium-performing: 1 week to 1 month of lead time
- High-performing: 1 day to 1 week of lead time
- Elite-performing: Less than 1 day of lead time
- Deployment frequency, which measures the successful release count to production. The key word here is successful, as a Dev team that constantly pushes broken code through the pipeline is not great:
- Low-performing: 1 month to 6 months between deployments
- Medium-performing: 1 week to 1 month between deployments
- High-performing: 1 day to 1 week between deployments
- Elite-performing: Multiple deployments per day/less than 1 day between deployments
- Change failure rate, which measures the percentage of deployments that result in a failure in production that requires a bug fix or rollback. The goal is to release as frequently as possible, but what is the point if your team is constantly rolling back those changes, or causing an incident by releasing a bad update? By tracking it, you can see how often your team is fixing something that could have been avoided:
- Low-performing: 45% to 60% CFR
- Medium-performing: 15% to 45% CFR
- High-performing: 0% to 15% CFR
- Elite-performing: 0% to 15% CFR
- Mean time to restore (MTTR) measures how long it takes an organization to recover from a failure. This is measured from the initial moment of an outage until the incident team has recovered all services and operations. Another key and related metric is mean time to acknowledge (MTTA), which measures the time it takes to be aware of and confirm an issue in production:
- Low-performing: 1 week to 1 month of downtime
- Medium- and high-performing: Less than 24 hours of downtime
- Elite-performing: Less than 1 hour of downtime
In conclusion, SLOs are crucial in setting reliability targets for a service, with a recommendation for these to be under 100% to account for dependencies and potential service failures. Utilizing tools such as SLIs, SLAs, and error budgets is essential in defining the appropriate SLO baseline, usually around or above 99%. We have also highlighted the importance of core operational metrics, as suggested by DORA, in assessing the performance of a DevOps team. These metrics, including lead time for change, deployment frequency, change failure rate, and MTTR, provide tangible criteria to measure and improve a team’s efficiency and effectiveness in service delivery and incident response.