You're reading from Becoming a Rockstar SRE Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems

Product type Paperback

Published in Apr 2023

Publisher Packt

ISBN-13 9781803239224

Length 420 pages

Edition 1st Edition

Languages

Python

Tools

Argo CD

Concepts

DevOps

Authors (2):

Jeremy Proffitt

Rod Anami L. Anami

View More author details

Table of Contents (27) Chapters

Preface

1. Part 1 - Understanding the Basics of Who, What, and Why

2. Chapter 1: SRE Job Role – Activities and Responsibilities FREE CHAPTER

3. Chapter 2: Fundamental Numbers – Reliability Statistics

4. Chapter 3: Imperfect Habits – Duct Tape Architecture and Spaghetti Code

5. Part 2 - Implementing Observability for Site Reliability Engineering

6. Chapter 4: Essential Observability – Metrics, Events, Logs, and Traces (MELT)

7. Chapter 5: Resolution Path – Master Troubleshooting

8. Chapter 6: Operational Framework – Managing Infrastructure and Systems

9. Chapter 7: Data Consumed – Observability Data Science

10. Part 3 - Applying Architecture for Reliability

11. Chapter 8: Reliable Architecture – Systems Strategy and Design

12. Chapter 9: Valued Automation – Toil Discovery and Elimination

13. Chapter 10: Exposing Pipelines – GitOps and Testing Essentials

14. Chapter 11: Worker Bees – Orchestrations of Serverless, Containers, and Kubernetes

15. Chapter 12: Final Exam – Tests and Capacity Planning

16. Part 4 - Mastering the Outage Moments

17. Chapter 13: First Thing – Runbooks and Low Noise Outage Notifications

18. Chapter 14: Rapid Response – Outage Management Techniques

19. Chapter 15: Postmortem Candor – Long-Term Resolution

20. Part 5 - Looking into Future Trends and Preparing for SRE Interviews

21. Chapter 16: Chaos Injector – Advanced Systems Stability

22. Chapter 17: Interview Advice – Hiring and Being Hired

23. Index

Why subscribe?

24. Other Books You May Enjoy

Appendix A – The Site Reliability Engineer Manifesto

1. Appendix B – The 12-Factor App Questionnaire

An overview of the daily activities of an SRE

Now that we have examined SRE responsibilities, it’s time to check what you, as an SRE, should be performing on a frequent basis. There’s no better way to understand a profession than by asking what someone does in it. When you go to a job interview, you probably want to know the activities a person in that position will carry out. SREs will have a list of assignments as sticky notes on their displays. We have separated those notable activities into two sections:

Reactive work activities
Proactive work activities

We’ll start by understanding reactive activities.

Reactive work activities

SREs execute many tasks that don’t lift (or shift) system reliability directly; they are usually operational types of work. Nevertheless, those activities either lessen the service downtime or mitigate risks. Examples of jobs that SREs perform daily in this category are as follows:

Repair or restore a system or multiple services to their original state
Follow and execute instructions from a runbook (standard operating procedure) during an incident to diagnose the application
Implement a change request to apply a patch to a software component
Attend a meeting to run a postmortem with system administrators and developers about the recent service or system outage
Install a new Kubernetes cluster for a new application according to the development team’s specifications and enable monitoring of it
Configure a new cloud-based service for a new application following the architecture design and include it in cloud monitoring
Deploy a new software release to VMs and execute the testing scripts

Proactive work activities

SREs also carry out jobs that improve the quality, scalability, observability, manageability, resiliency, or availability of a system or service. Since those tasks increase the reliability levels of specific systems or services, they are considered proactive and mostly engineering type of work. Such assignments affect toil and technical debt. Examples of this category are as follows:

Maintain a runbook on how to diagnose problems with a specific application
Design and develop an automaton to execute procedures previously documented in a runbook automatically
Establish, together with the DevOps team, the release strategy, such as a canary release, A/B testing, or blue-green deployment
Work with the SWE to add management code to the application so SREs can instruct the application to do self-administration or self-healing operations
Work with the development team to adopt an immutable infrastructure philosophy into the application-building process
Instrument the application code to increase its observability with logs and traces
Design and implement observability to obtain good metrics, events, logs, and traces from a critical application

Note

Site reliability engineers perform many more activities than the ones listed here. This is not a comprehensive list; the only intention is to show you how SREs work across multiple dimensions and aspects of systems and services.

We listed what an SRE does frequently. We wanted to give you a good sense of their day-to-day activities and how it differs from other roles. Again, this is not a complete or closed list. We want to close this chapter by telling you who our SRE rockstars are.

You're reading from Becoming a Rockstar SRE Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems

Table of Contents (27) Chapters

An overview of the daily activities of an SRE

Reactive work activities

Proactive work activities

Authors (2)

Personalised recommendations for you