You're reading from Becoming a Rockstar SRE Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems

Product type Paperback

Published in Apr 2023

Publisher Packt

ISBN-13 9781803239224

Length 420 pages

Edition 1st Edition

Languages

Python

Tools

Argo CD

Concepts

DevOps

Authors (2):

Jeremy Proffitt

Rod Anami L. Anami

View More author details

Table of Contents (27) Chapters

Preface

1. Part 1 - Understanding the Basics of Who, What, and Why

2. Chapter 1: SRE Job Role – Activities and Responsibilities FREE CHAPTER

3. Chapter 2: Fundamental Numbers – Reliability Statistics

4. Chapter 3: Imperfect Habits – Duct Tape Architecture and Spaghetti Code

5. Part 2 - Implementing Observability for Site Reliability Engineering

6. Chapter 4: Essential Observability – Metrics, Events, Logs, and Traces (MELT)

7. Chapter 5: Resolution Path – Master Troubleshooting

8. Chapter 6: Operational Framework – Managing Infrastructure and Systems

9. Chapter 7: Data Consumed – Observability Data Science

10. Part 3 - Applying Architecture for Reliability

11. Chapter 8: Reliable Architecture – Systems Strategy and Design

12. Chapter 9: Valued Automation – Toil Discovery and Elimination

13. Chapter 10: Exposing Pipelines – GitOps and Testing Essentials

14. Chapter 11: Worker Bees – Orchestrations of Serverless, Containers, and Kubernetes

15. Chapter 12: Final Exam – Tests and Capacity Planning

16. Part 4 - Mastering the Outage Moments

17. Chapter 13: First Thing – Runbooks and Low Noise Outage Notifications

18. Chapter 14: Rapid Response – Outage Management Techniques

19. Chapter 15: Postmortem Candor – Long-Term Resolution

20. Part 5 - Looking into Future Trends and Preparing for SRE Interviews

21. Chapter 16: Chaos Injector – Advanced Systems Stability

22. Chapter 17: Interview Advice – Hiring and Being Hired

23. Index

Why subscribe?

24. Other Books You May Enjoy

Appendix A – The Site Reliability Engineer Manifesto

1. Appendix B – The 12-Factor App Questionnaire

DevOps engineers versus SRE versus others

This is one of the most frequently asked questions we receive from customers and organizations: how does the site reliability engineering profession differ from other existing technical roles? We already talked about how SREs are the connection between the different steps of the solution life cycle. Here, we’ll focus our discussion on the DevOps engineer role, and later, we’ll broaden it. We have split this discussion into two sections:

DevOps and site reliability engineers
Software and site reliability engineers

DevOps and site reliability engineers

Google described the relationship between DevOps and SRE with a famous subtitle in their The Site Reliability Workbook publication:

Class SRE implements interface DevOps

This statement is an elegant way to define this link and refers to Java programming. It implies that site reliability engineering describes and deepens the implementation of whatever DevOps is. Moreover, we can say that site reliability engineering has commonalities with DevOps as a logically derived conclusion. However, what exactly does site reliability engineering implement from DevOps, or what are the differences between a site reliability engineer and a DevOps engineer? We have visualized these similarities and divergences in an infographic as follows:

Figure 1.3 – An infographic on SRE and DevOps

Notice that they have shared values. Both SREs and DevOps engineers require those values in the orange (bottom right in the above diagram) box. In the bottom-left table, you can see the difference between those roles. Typically, site reliability engineers resolve operational problems by applying the right software engineering disciplines. On the other hand, DevOps engineers resolve development and delivery pipeline issues with systems management techniques mainly by using automation and infrastructure-as-code. They also concentrate different levels of effort on distinct phases of the solution life cycle, as depicted in the infographic.

It’s not rare to hear that DevOps is a shift-right transformation while site reliability engineering is a shift-left one. That implies moving from the left (development side of the equation) to the right (operations side of the equation), and vice versa. Another term we hear a lot is DevSecOps, which has the addition of security. Since security has always been implied in these roles, we think including new letters in the middle is confusing and redundant.

SREs and DevOps engineers are, in our opinion, different sides of the same coin. They should be more like best friends forever than opposing roles as they share values. Let’s check how SREs fulfill those values from the five main areas of DevOps:

Reduce organizational silos: SREs use the same tooling as developers or DevOps engineers. They also share objectives and performance metrics with them.
Accept failure as normal: SREs embrace risks using the error budget for new features. They quantify failure through SLIs and SLOs. And they run postmortems in a blameless culture.
Implement gradual changes: SREs work to increase reliability, and more reliable systems allow more frequent changes and releases.
Leverage tooling and automation: SREs eliminate toil by automating operational tasks at a constant pace.
Measure everything: SREs measure reliability by implementing MELT data and observability. They also have ways to identify and size toil.

Software and site reliability engineers

Another frequently asked question is how site reliability engineers differ from software engineers (SWEs). The short answer is simple: they have the same core skills but specific work scopes.

What are SWEs? SWEs design, engineer, and architect applications using modeling languages and requirements analysis techniques. They implement an integrated development environment (IDE) and develop code for use cases using one of the multiple available programming languages. They create test cases and testing suites. Also, they integrate software and service components and handle their dependencies. SWEs work with many software development life cycle tools and processes.

Site reliability engineers may execute the same activities, but they intend to improve reliability when doing so. For instance, developing code for an SRE translates much more to instrumenting the application code, so it generates more logs, than coding a use case. Also, SREs treat operations as a software problem and see daily systems management tasks as possible software coding opportunities. Besides that, SREs have other core skills, relating to systems thinking, systems management, and data science.

Indeed, an SRE could become an SWE and vice versa, and that leads us to another principle that we find in the Google materials.

Common staffing pool

Another principle is hiring site reliability engineers and SWEs from the same staffing pool. This principle works well for companies where most employees are software developers and engineers, and having a shared pool means that site reliability and software engineering job roles are interchangeable. However, this principle may be much more challenging for enterprises with a mix of systems administrators and developers. Hence, we left it out of our list in the previous section.

We could compare the SRE’s unique profession to many others, but we limited this topic to the most common comparisons. SREs are not architects, developers, systems administrators, or data scientists; they are more than all of these roles combined. Up next, we are going to understand the primary responsibilities of an SRE.

You're reading from Becoming a Rockstar SRE Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems

Table of Contents (27) Chapters Close

DevOps engineers versus SRE versus others

DevOps and site reliability engineers

Software and site reliability engineers

Common staffing pool

Authors (2)

Personalised recommendations for you

Table of Contents (27) Chapters