Understanding SLAs and KPIs in DevOps
In the Understanding IT delivery in enterprises section, we learned that in DevOps, IT delivery and IT service management processes are still valid. Typically, enterprises contract SLAs and KPIs to fulfill these processes so that these enable the business goals. If one of the processes fails, the delivery of the product will be impacted and as an ultimate consequence, the business will not achieve its goals, such as an agreed delivery date or go live release of the product. Hence, understanding SLAs and KPIs is important for any architect. This is why it is included in the sourcing model that we discussed in the IT delivery in sourcing models section.
Service-level agreements are positioned between the tactical processes of DevOps and the strategic level at the enterprise level where the goals are set. SLAs and KPIs should support these goals and guide the DevOps process.
The six most important metrics that should be included in SLAs for DevOps are as follows:
- Frequency of deployments: Typically, DevOps teams work in sprints, a short period of time in which the team works on a number of backlog items as part of the next release of a product. The KPI measures how often new features are launched on a regular basis. Keep in mind that releases of new features can be scheduled on a monthly (often spanning multiple sprints), weekly, or even daily basis.
- Deployment time: The time that elapses between the code being released after the test phase to preproduction and ultimately production, including the ready state of the infrastructure.
- Deployment failure rate: This refers to the rate of outages that occur after a deployment. Ideally, this should be zero, but this is not very realistic. Deployments – especially when the change rate is high – will fail every now and then. Obviously, the number should be as low as possible.
- Deployment failure detection time: This KPI strongly relates to the previous one. Failures will occur, but then the question is, how fast are these detected and when will mitigating actions to resolve these issues be taken? This KPI is often also referred to as Mean Time to Recovery (MTTR). This is the most important KPI in DevOps cycles.
- Change lead time: This is the time that elapses between the last release and the next change to occur. Subsequently, it is measured in terms of how long the team will need to address the change. Shorter lead times indicate that the team works efficiently.
- Full cycle time: The total time that elapses for each iteration or each deployment.
This list is by no means exhaustive. Enterprises can think of a lot of different metrics and KPIs. But the advice here is to keep things simple. Keep in mind that every metric that is included in any contract needs to be monitored and reported, which can become very cumbersome. One more thing to remember is that the most important metric sits at the business level. Ultimately, the only thing that really counts is how satisfied the customer of the business is or, better said: what's the value that's delivered to the end customer?
In the final section of this chapter, we will elaborate on the term value by explaining the VOICE model.