You're reading from Becoming a Rockstar SRE Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems

Product type Paperback

Published in Apr 2023

Publisher Packt

ISBN-13 9781803239224

Length 420 pages

Edition 1st Edition

Languages

Python

Tools

Argo CD

Concepts

DevOps

Authors (2):

Jeremy Proffitt

Rod Anami L. Anami

View More author details

Table of Contents (27) Chapters

Preface

1. Part 1 - Understanding the Basics of Who, What, and Why

2. Chapter 1: SRE Job Role – Activities and Responsibilities FREE CHAPTER

3. Chapter 2: Fundamental Numbers – Reliability Statistics

4. Chapter 3: Imperfect Habits – Duct Tape Architecture and Spaghetti Code

5. Part 2 - Implementing Observability for Site Reliability Engineering

6. Chapter 4: Essential Observability – Metrics, Events, Logs, and Traces (MELT)

7. Chapter 5: Resolution Path – Master Troubleshooting

8. Chapter 6: Operational Framework – Managing Infrastructure and Systems

9. Chapter 7: Data Consumed – Observability Data Science

10. Part 3 - Applying Architecture for Reliability

11. Chapter 8: Reliable Architecture – Systems Strategy and Design

12. Chapter 9: Valued Automation – Toil Discovery and Elimination

13. Chapter 10: Exposing Pipelines – GitOps and Testing Essentials

14. Chapter 11: Worker Bees – Orchestrations of Serverless, Containers, and Kubernetes

15. Chapter 12: Final Exam – Tests and Capacity Planning

16. Part 4 - Mastering the Outage Moments

17. Chapter 13: First Thing – Runbooks and Low Noise Outage Notifications

18. Chapter 14: Rapid Response – Outage Management Techniques

19. Chapter 15: Postmortem Candor – Long-Term Resolution

20. Part 5 - Looking into Future Trends and Preparing for SRE Interviews

21. Chapter 16: Chaos Injector – Advanced Systems Stability

22. Chapter 17: Interview Advice – Hiring and Being Hired

23. Index

Why subscribe?

24. Other Books You May Enjoy

Appendix A – The Site Reliability Engineer Manifesto

1. Appendix B – The 12-Factor App Questionnaire

Understanding the mindset and hobbies of an SRE

It’s not rare for site reliability engineers to have a broader and divergent view of their surroundings. We are not saying that SREs are weird; well, they are in a certain sense, as they employ a relentless search for improving reliability in all things. However, we are referring to their mindset and how they approach the world.

In this section, we will explore different aspects of their thought process in the work environment and what they like to do in the job and outside it. We have divided this topic into three sections:

SRE affinity game
SRE guiding principles
SRE hobbies

You may have asked whether site reliability engineering is the right profession for you. Let’s examine that next.

SRE affinity game

Let’s play a game! What do you think your affinity or compatibility is with the site reliability engineering profession? We will present a series of scenarios that SREs face. You need to answer them with either love, like, dislike, or hate indicating how much you see yourself doing it and how you would feel about it. Try to be as honest as possible.

Disclaimer

This is not an anthropological scientific survey based on a human behavioral model or theory by any means. It’s a simple questionnaire to help you understand your own affinity to the SRE job role.

The scenarios are in the following list. Get a piece of paper, write down the question number, and answer it. Good luck!

Your boss asks you to resolve a problem that no one else has ever resolved.
You need to spend a few hours looking through logs, metrics, graphs, and events to verify whether there are any new anomalies that were not detected automatically.
You need to participate in an on-call rotation or schedule where you might be called late in the night to respond to a service disruption that has a business impact.
You need to work on a backend system or software that is not visible to external users.
You need to devise new ways to increase a large system’s overall reliability.
You are asked to work on a large-scale problem, which affects hundreds of users and has dozens of components and dependencies, that runs on a hybrid multi-cloud environment.
You are diagnosing a system problem that is making users from a certain geography unable to access their services, and there is great pressure on you.
You need to approach problems with a selected scientific method or data model to uncover facts instead of guessing.
You constantly ask yourself how you could make things around you better and more reliable.
You need to classify and categorize systems information and functionalities so you can isolate causes from effects.
You must diagnose and fix a system problem by investigating components that are not usually visible by going deep into each component configuration as debugging mode is not available.
You need to design a detailed diagram of how the user interacts with a system or software so you can point out where to observe for symptoms.

After you complete this exercise, assign points to each of the answers. If you replied to a scheme with a love answer, assign 5 points to it. For a like answer, you get 3 points. Dislike has a value of 0, and hate is -3 (negative!). Sum your points across all 12 scenarios to get your score, and check the result against the following list:

Over 34 points: Your affinity is very high; this is the right career for you
From 21 to 34 points: Your affinity is high; you should consider this profession
From 13 to 20 points: Your affinity is medium; this may be a good job role for you
Below 13 points: SRE may not be your best option

This may be a game, but it will have made you imagine yourself in an SRE’s shoes. We have started to understand the SRE mindset, so let’s check what guides them in the convoluted scenarios listed previously.

SRE guiding principles

Everyone has a conjunction of principles (and values) that acts as their compass. SREs also follow a set of values; they embrace guiding principles to advise them on technical decisions and act as a reliability compass.

Google® coined most of those principles in its site reliability engineering books (https://sre.google/books/), but others appeared later in conference sessions at SREcon (https://www.usenix.org/srecon) and blog posts on many websites.

Again, we have selected some of them as canonical guiding principles based on our experience in assisting customers and organizations in enabling site reliability engineering in their IT shops. The following is the set of guiding principles that are rooted in the SRE persona:

Scalable operations
Engineering fidelity
Observability to the core
Well-designed service levels
User-perspective notification trigger
Blameless postmortems
Simplicity

We must remark that such principles are not procedures or prescriptive instructions to accomplish something but guidelines. Don’t worry if you are not familiar with the terminology applied here; we dig into them in a detailed manner throughout the book. Let’s investigate each of them along with their most familiar patterns and anti-patterns.

Scalable operations

The operations team, which includes site reliability engineers, is responsible for managing production systems. They are the first responders for any service disruption when something goes wrong. The scalable operations principle states that this team will not grow proportionally to the system as its load increases. Another way to say that is if the number of active users for the determined service doubles, the operations team size will not double. A more mathematically accurate way to visualize this is through a logarithm growth curve. As the operations team gains technical maturity, eliminates repetitive manual tasks, and adopts automation at large, they will need fewer resources to manage more system load:

Figure 1.2 – A logarithm growth curve

It is worth mentioning that SREs employ a proactive approach as they strive to identify the root cause of issues and devise solutions to detect or prevent problems. The patterns for this principle are as follows:

Identify and eliminate toil whenever possible
Document operational procedures as runbooks
Train operations teams to use and refine runbooks
Adopt automation platforms and automated procedures documented in runbooks at large