You're reading from DevOps for Databases A practical guide to applying DevOps best practices to data-persistent technologies

Product type Paperback

Published in Dec 2023

Publisher Packt

ISBN-13 9781837637300

Length 446 pages

Edition 1st Edition

Tools

Couchbase

Concepts

Databases

Author (1):

David Jambor

View More author details

Table of Contents (24) Chapters

Preface

1. Part 1: Database DevOps

2. Chapter 1: Data at Scale with DevOps FREE CHAPTER

3. Chapter 2: Large-Scale Data-Persistent Systems

4. Chapter 3: DBAs in the World of DevOps

5. Part 2: Persisting Data in the Cloud

6. Chapter 4: Cloud Migration and Modern Data(base) Evolution

7. Chapter 5: RDBMS with DevOps

8. Chapter 6: Non-Relational DMSs with DevOps

9. Chapter 7: AI, ML, and Big Data

10. Part 3: The Right Tool for the Job

11. Chapter 8: Zero-Touch Operations

12. Chapter 9: Design and Implementation

13. Chapter 10: Database Automation

14. Part 4: Build and Operate

15. Chapter 11: End-to-End Ownership Model – a Theoretical Case Study

16. Chapter 12: Immutable and Idempotent Logic – A Theoretical Case Study

17. Chapter 13: Operators and Self-Healing Data Persistent Systems

18. Chapter 14: Bringing Them Together

19. Part 5: The Future of Data

20. Chapter 15: Specializing in Data

21. Chapter 16: The Exciting New World of Data

22. Index

Why subscribe?

23. Other Books You May Enjoy

Objectives – SLOs/SLIs

Service-level objectives (SLOs), which is a concept that’s referenced many times in Google’s SRE handbook, can be a great help to set your direction from the start. Choosing the right objective, however, can be trickier than you might think.

My personal experience aligns with Google’s recommendation, which suggests that an SLO – which sets the target for the reliability of a service’s customers – should be under 100%.

This is due to multiple reasons. Achieving 100% is not just very hard and extremely expensive, but almost impossible given that almost all services have soft/hard dependencies on other services. If just one of your dependencies offers less than 100% availability, your SLO cannot be met. Also, even with every precaution you can make, and every redundancy in place, there is a non-zero probability that something (or many things) will fail, resulting in less than 100% availability. More importantly, even if you could achieve 100% reliability of your services, the customers would very likely not experience that. The path your customers must take (the systems they have to use) to access your services is likely to have less than 100% SLO.

Most commercial internet providers, for example, offer 99% availability. This also means that as you go higher and higher, let’s say from 99% to 99.9% or IBM’s extreme five nines (99.999%), the cost of achieving and maintaining this availability will be significantly more expensive the more “nines” you add, but your customers will experience less and less of your efforts, which makes the objective questionable.

Above the selected SLO threshold, almost all users should be “happy,” and below this threshold, users are likely to be unhappy, raise concerns, or just stop using the service.

Once you’ve agreed that you should look for an SLO less than 100%, but likely somewhere above or around 99%, how do you define the right baseline?

This is where service-level indicators (SLIs), service-level agreements (SLAs), and error budgets come into play. I will not detail all of these here, but if you are interested, please refer to Google’s SRE book (https://sre.google/books/) for more details on the subject.

Let’s say you picked an SLO of 99.9% – which is, based on my personal experience, the most common go-to for businesses these days. You now have to consider your core operational metrics. DevOps Research and Assessment (DORA) suggests four key metrics that indicate the performance of a DevOps team, ranking them from “low” to “elite,” where “elite” teams are more likely to meet or even exceed their goals and delight their customers compared to “low” ranking teams.

These four metrics are as follows:

Lead time for change, a metric that quantifies the duration from code commit to production deployment, is in my view one of the most crucial indicators. It serves as a measure of your team’s agility and responsiveness. How swiftly can you resolve a bug? Think about it this way:
- Low-performing: 1 month to 6 months of lead time
- Medium-performing: 1 week to 1 month of lead time
- High-performing: 1 day to 1 week of lead time
- Elite-performing: Less than 1 day of lead time
Deployment frequency, which measures the successful release count to production. The key word here is successful, as a Dev team that constantly pushes broken code through the pipeline is not great:
- Low-performing: 1 month to 6 months between deployments
- Medium-performing: 1 week to 1 month between deployments
- High-performing: 1 day to 1 week between deployments
- Elite-performing: Multiple deployments per day/less than 1 day between deployments
Change failure rate, which measures the percentage of deployments that result in a failure in production that requires a bug fix or rollback. The goal is to release as frequently as possible, but what is the point if your team is constantly rolling back those changes, or causing an incident by releasing a bad update? By tracking it, you can see how often your team is fixing something that could have been avoided:
- Low-performing: 45% to 60% CFR
- Medium-performing: 15% to 45% CFR
- High-performing: 0% to 15% CFR
- Elite-performing: 0% to 15% CFR
Mean time to restore (MTTR) measures how long it takes an organization to recover from a failure. This is measured from the initial moment of an outage until the incident team has recovered all services and operations. Another key and related metric is mean time to acknowledge (MTTA), which measures the time it takes to be aware of and confirm an issue in production:
- Low-performing: 1 week to 1 month of downtime
- Medium- and high-performing: Less than 24 hours of downtime
- Elite-performing: Less than 1 hour of downtime

In conclusion, SLOs are crucial in setting reliability targets for a service, with a recommendation for these to be under 100% to account for dependencies and potential service failures. Utilizing tools such as SLIs, SLAs, and error budgets is essential in defining the appropriate SLO baseline, usually around or above 99%. We have also highlighted the importance of core operational metrics, as suggested by DORA, in assessing the performance of a DevOps team. These metrics, including lead time for change, deployment frequency, change failure rate, and MTTR, provide tangible criteria to measure and improve a team’s efficiency and effectiveness in service delivery and incident response.

You're reading from DevOps for Databases A practical guide to applying DevOps best practices to data-persistent technologies

Table of Contents (24) Chapters

Objectives – SLOs/SLIs

Authors (1)

Personalised recommendations for you