Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
DevOps for Databases

You're reading from   DevOps for Databases A practical guide to applying DevOps best practices to data-persistent technologies

Arrow left icon
Product type Paperback
Published in Dec 2023
Publisher Packt
ISBN-13 9781837637300
Length 446 pages
Edition 1st Edition
Concepts
Arrow right icon
Author (1):
Arrow left icon
David Jambor David Jambor
Author Profile Icon David Jambor
David Jambor
Arrow right icon
View More author details
Toc

Table of Contents (24) Chapters Close

Preface 1. Part 1: Database DevOps
2. Chapter 1: Data at Scale with DevOps FREE CHAPTER 3. Chapter 2: Large-Scale Data-Persistent Systems 4. Chapter 3: DBAs in the World of DevOps 5. Part 2: Persisting Data in the Cloud
6. Chapter 4: Cloud Migration and Modern Data(base) Evolution 7. Chapter 5: RDBMS with DevOps 8. Chapter 6: Non-Relational DMSs with DevOps 9. Chapter 7: AI, ML, and Big Data 10. Part 3: The Right Tool for the Job
11. Chapter 8: Zero-Touch Operations 12. Chapter 9: Design and Implementation 13. Chapter 10: Database Automation 14. Part 4: Build and Operate
15. Chapter 11: End-to-End Ownership Model – a Theoretical Case Study 16. Chapter 12: Immutable and Idempotent Logic – A Theoretical Case Study 17. Chapter 13: Operators and Self-Healing Data Persistent Systems 18. Chapter 14: Bringing Them Together 19. Part 5: The Future of Data
20. Chapter 15: Specializing in Data 21. Chapter 16: The Exciting New World of Data 22. Index 23. Other Books You May Enjoy

Objectives – SLOs/SLIs

Service-level objectives (SLOs), which is a concept that’s referenced many times in Google’s SRE handbook, can be a great help to set your direction from the start. Choosing the right objective, however, can be trickier than you might think.

My personal experience aligns with Google’s recommendation, which suggests that an SLO – which sets the target for the reliability of a service’s customers – should be under 100%.

This is due to multiple reasons. Achieving 100% is not just very hard and extremely expensive, but almost impossible given that almost all services have soft/hard dependencies on other services. If just one of your dependencies offers less than 100% availability, your SLO cannot be met. Also, even with every precaution you can make, and every redundancy in place, there is a non-zero probability that something (or many things) will fail, resulting in less than 100% availability. More importantly, even if you could achieve 100% reliability of your services, the customers would very likely not experience that. The path your customers must take (the systems they have to use) to access your services is likely to have less than 100% SLO.

Most commercial internet providers, for example, offer 99% availability. This also means that as you go higher and higher, let’s say from 99% to 99.9% or IBM’s extreme five nines (99.999%), the cost of achieving and maintaining this availability will be significantly more expensive the more “nines” you add, but your customers will experience less and less of your efforts, which makes the objective questionable.

Above the selected SLO threshold, almost all users should be “happy,” and below this threshold, users are likely to be unhappy, raise concerns, or just stop using the service.

Once you’ve agreed that you should look for an SLO less than 100%, but likely somewhere above or around 99%, how do you define the right baseline?

This is where service-level indicators (SLIs), service-level agreements (SLAs), and error budgets come into play. I will not detail all of these here, but if you are interested, please refer to Google’s SRE book (https://sre.google/books/) for more details on the subject.

Let’s say you picked an SLO of 99.9% – which is, based on my personal experience, the most common go-to for businesses these days. You now have to consider your core operational metrics. DevOps Research and Assessment (DORA) suggests four key metrics that indicate the performance of a DevOps team, ranking them from “low” to “elite,” where “elite” teams are more likely to meet or even exceed their goals and delight their customers compared to “low” ranking teams.

These four metrics are as follows:

  • Lead time for change, a metric that quantifies the duration from code commit to production deployment, is in my view one of the most crucial indicators. It serves as a measure of your team’s agility and responsiveness. How swiftly can you resolve a bug? Think about it this way:
    • Low-performing: 1 month to 6 months of lead time
    • Medium-performing: 1 week to 1 month of lead time
    • High-performing: 1 day to 1 week of lead time
    • Elite-performing: Less than 1 day of lead time
  • Deployment frequency, which measures the successful release count to production. The key word here is successful, as a Dev team that constantly pushes broken code through the pipeline is not great:
    • Low-performing: 1 month to 6 months between deployments
    • Medium-performing: 1 week to 1 month between deployments
    • High-performing: 1 day to 1 week between deployments
    • Elite-performing: Multiple deployments per day/less than 1 day between deployments
  • Change failure rate, which measures the percentage of deployments that result in a failure in production that requires a bug fix or rollback. The goal is to release as frequently as possible, but what is the point if your team is constantly rolling back those changes, or causing an incident by releasing a bad update? By tracking it, you can see how often your team is fixing something that could have been avoided:
    • Low-performing: 45% to 60% CFR
    • Medium-performing: 15% to 45% CFR
    • High-performing: 0% to 15% CFR
    • Elite-performing: 0% to 15% CFR
  • Mean time to restore (MTTR) measures how long it takes an organization to recover from a failure. This is measured from the initial moment of an outage until the incident team has recovered all services and operations. Another key and related metric is mean time to acknowledge (MTTA), which measures the time it takes to be aware of and confirm an issue in production:
    • Low-performing: 1 week to 1 month of downtime
    • Medium- and high-performing: Less than 24 hours of downtime
    • Elite-performing: Less than 1 hour of downtime

In conclusion, SLOs are crucial in setting reliability targets for a service, with a recommendation for these to be under 100% to account for dependencies and potential service failures. Utilizing tools such as SLIs, SLAs, and error budgets is essential in defining the appropriate SLO baseline, usually around or above 99%. We have also highlighted the importance of core operational metrics, as suggested by DORA, in assessing the performance of a DevOps team. These metrics, including lead time for change, deployment frequency, change failure rate, and MTTR, provide tangible criteria to measure and improve a team’s efficiency and effectiveness in service delivery and incident response.

You have been reading a chapter from
DevOps for Databases
Published in: Dec 2023
Publisher: Packt
ISBN-13: 9781837637300
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image