DevOps has taken the IT world by storm and is increasingly becoming the de facto industry standard for software development. The DevOps principles have the potential to result in a competitive differentiation allowing the teams to deliver a high quality software developed at a faster rate which adequately meets the customer requirements.
DevOps prevents the development and operations teams from functioning in two distinct silos and ensures seamless collaboration between all the stakeholders. Collection of feedback and its subsequent incorporation plays a critical role in DevOps implementation and formulation of a CI/CD pipeline.
Successful transition to DevOps is a journey, not a destination. Setting up benchmarks, measuring yourself against them and tracking your progress is important for determining the stage of DevOps architecture you are in and ensuring a smooth journey onward.
Feedback loops are a critical enabler for delivery of the application and metrics help transform the qualitative feedback into quantitative form. Collecting the feedback from the stakeholders is only half the work, gathering insights and communicating it through the DevOps team to keep the CI/CD pipeline on track is equally important. This is where the role of metrics comes in.
DevOps metrics are the tools that your team needs for ensuring that the feedback is collected and communicated with the right people to improve upon the existing processes and functions in a unit.
Here are 7 DevOps metrics that your team needs to track for a successful DevOps transformation:
Quick iteration and continuous delivery are key measurements of DevOps success. It basically means how long the software takes to deploy and how often the deployment takes place. Keeping track of the frequency with which the new code is deployed helps keep track of the development process.
The ultimate goal of deployment is to be able to release smaller deployments of code as quickly as possible. Smaller deployments are easier to test and release. They also improve the discoverability of bugs in the code allowing for faster and timely resolution of the same.
Determining the frequency of deployments needs to be done separately for development, testing, staging, and production environments. Keeping track of the frequency of deployment to QA or pre-production environments is also an important consideration.
A high deployment frequency is a tell-tale sign that things are going smooth in the production cycle. Smaller deployments are easier to test and release so higher deployment frequency directly corresponds with higher efficiency.
No wonder tech giants such as Amazon and Netflix deploy code thousands of times a day. Amazon has built a deployment engine called Apollo that has deployed more than 50 million deployments in 12 months which is more than one deployment per second. This results in reduced outages and decreased downtimes.
Any deployment that causes issues or outages for your users is a failed deployment. Tracking the percentage of deployments that result in negative feedback from the user’s end is an important DevOps metric.
The DevOps teams are expected to build quality in the product right from the beginning of the project. The responsibility for ensuring the quality of the software is also disseminated through the entire team and not just centered around the QA. While in an ideal scenario, there should be no failed deployments, that’s often not the case.
Tracking the percentage of deployment that results in negative sentiment in the project helps you ascertain the ground level realities and makes you better prepared for such occurrences in the future. Only if you know what is wrong can you formulate a plan to fix it.
While a failure rate of 0 is the magic number, less than 5% failed deployments is considered workable. In case the metric consistently shows spike of failed deployments over 10%, the existing process needs to be broken down into smaller segments with mini-deployments. Fixing 5 issues in 100 deployments is any day easier than fixing 50 in 1000 within the same time-frame.
Code committed is a DevOps metric that tracks the number of commits the team makes to the software before it can be deployed into production. This serves as an indicator of the development velocity as well as the code quality. The number of code commits that a team makes has to be within the optimum range defined by the DevOps team.
Too many commits may be indicative of low quality or lack of direction in development. Similarly, if the commits are too low, it may be an indicator that the team is too taxed and non-productive. Uncovering the reason behind the variation in code committed is important for maintaining the productivity and project velocity while also ensuring optimal satisfaction within the team members.
The software development cycle is a continuous process where new code is constantly developed and successfully deployed to production. Lead time for changes in DevOps is the time taken to go from code committed to code successfully running into production. It is an important indicator to determine the efficiency in the existing process and identifying the possible areas of improvement.
The lead time and mean time to change (MTTC) result in the DevOps team getting a better hold of the project. By measuring the amount of time passing between its inception and the actual production and deployment, the team’s ability to adapt to change as the project requirements evolve can be computed.
Errors in any software application are inevitable. A few occasional errors aren’t a red flag but keeping track of the error rates and being on the lookout for any unusual spikes is important for the health of your application.
A significant rise in error rate is an indicator of inherent quality problems and ongoing performance-related issues. The errors that you encounter can be of two types, bugs and production issues. Bugs are the exceptions in the code discovered after deployment. Production issues, on the other hand, are issues related to database connections and query timeouts.
The error rate is calculated as a function of the transactions that result in an error during a particular time window. For a specified time duration, out of a 1000 transactions, if 20 have errors, the error rate is calculated as 20/1000 or 2 percent. A few intermittent errors throughout the application life cycle is a normal occurrence but any unusual spikes that occur need to be looked out for. The process needs to be analysed for bugs and production issues and the exceptions that occur need to be handled concurrently.
Issues happen in every project but how fast you discover the issues is what matters. Having robust application monitoring and optimal coverage would help you find out any issues that happen as quickly as possible.
The mean time to detection metric (MTTD) is the amount of time that passes between the beginning of the issue and the time when the issue gets detected and some remedial action is taken. The time to fix the issues is not covered under MTTD.
Ideally, the DevOps teams need to strive to keep the MTTD as low as possible (ideally close to zero) i.e the DevOps teams should be able to detect any issues as soon as they occur. There needs to be a proper protocol established and communication channels need to be in place in order to help the team discover the error quickly and respond to its correction in a rapid manner.
Time to restore service or Mean time to recovery (MTTR) is a critical part of any project. It is the average time taken by the team to repair a failure in the system. It comprises of the time taken from failure detection till the time the project starts operating in the normal manner. Recovery and resilience are key components that determine the market readiness of a project.
MTTR is an important DevOps metric because it allows for tracking of complex issues and failures while judging the capability of the team to handle change and bounce back again. The ideal recovery time for the fix to take place should be as low as possible, thus minimizing the overall system downtime.
System downtimes and outages though undesirable are unavoidable. This especially runs true in the current development scenario where companies are making the move to the cloud. Designing for failure is a concept that needs to be ingrained right from the start. Even major applications like Facebook & Whatsapp, Twitter, Cloudflare, and Slack are not free of outages. What matters is that the downtime is kept minimal. Mean time to recovery thus becomes critical to realize the time the DevOps teams would need to bring the system back on track.
DevOps isn’t just about tracking metrics, it is primarily about the culture. Organizations that make the transition to DevOps place immense emphasis on one goal-rapid delivery of stable, high-quality software through automation and continuous delivery.
Simply having a bunch of numbers in the form of DevOps metrics isn’t going to help you across the line. You need to have a long-term vision combined with valuable insights that the metrics provide. It is only by monitoring these over a period of time and tracking your team’s progress in achieving the goals that you have set can you hope to reap the true benefits that DevOps offers.
Vinati Kamani writes about emerging technologies and their applications across various industries for Arkenea, a custom software development company and devops consulting company. When she's not on her desk penning down articles or reading up on the recent trends, she can be found traveling to remote places and soaking up different cultural experiences.
DevOps engineering and full-stack development – 2 sides of the same agile coin
Introducing kdevops, modern devops framework for Linux kernel development
Why do IT teams need to transition from DevOps to DevSecOps?