Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Becoming a Rockstar SRE

You're reading from   Becoming a Rockstar SRE Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems

Arrow left icon
Product type Paperback
Published in Apr 2023
Publisher Packt
ISBN-13 9781803239224
Length 420 pages
Edition 1st Edition
Languages
Tools
Concepts
Arrow right icon
Authors (2):
Arrow left icon
Jeremy Proffitt Jeremy Proffitt
Author Profile Icon Jeremy Proffitt
Jeremy Proffitt
Rod Anami L. Anami Rod Anami L. Anami
Author Profile Icon Rod Anami L. Anami
Rod Anami L. Anami
Arrow right icon
View More author details
Toc

Table of Contents (27) Chapters Close

Preface 1. Part 1 - Understanding the Basics of Who, What, and Why
2. Chapter 1: SRE Job Role – Activities and Responsibilities FREE CHAPTER 3. Chapter 2: Fundamental Numbers – Reliability Statistics 4. Chapter 3: Imperfect Habits – Duct Tape Architecture and Spaghetti Code 5. Part 2 - Implementing Observability for Site Reliability Engineering
6. Chapter 4: Essential Observability – Metrics, Events, Logs, and Traces (MELT) 7. Chapter 5: Resolution Path – Master Troubleshooting 8. Chapter 6: Operational Framework – Managing Infrastructure and Systems 9. Chapter 7: Data Consumed – Observability Data Science 10. Part 3 - Applying Architecture for Reliability
11. Chapter 8: Reliable Architecture – Systems Strategy and Design 12. Chapter 9: Valued Automation – Toil Discovery and Elimination 13. Chapter 10: Exposing Pipelines – GitOps and Testing Essentials 14. Chapter 11: Worker Bees – Orchestrations of Serverless, Containers, and Kubernetes 15. Chapter 12: Final Exam – Tests and Capacity Planning 16. Part 4 - Mastering the Outage Moments
17. Chapter 13: First Thing – Runbooks and Low Noise Outage Notifications 18. Chapter 14: Rapid Response – Outage Management Techniques 19. Chapter 15: Postmortem Candor – Long-Term Resolution 20. Part 5 - Looking into Future Trends and Preparing for SRE Interviews
21. Chapter 16: Chaos Injector – Advanced Systems Stability 22. Chapter 17: Interview Advice – Hiring and Being Hired 23. Index 24. Other Books You May Enjoy Appendix A – The Site Reliability Engineer Manifesto 1. Appendix B – The 12-Factor App Questionnaire

Understanding the mindset and hobbies of an SRE

It’s not rare for site reliability engineers to have a broader and divergent view of their surroundings. We are not saying that SREs are weird; well, they are in a certain sense, as they employ a relentless search for improving reliability in all things. However, we are referring to their mindset and how they approach the world.

In this section, we will explore different aspects of their thought process in the work environment and what they like to do in the job and outside it. We have divided this topic into three sections:

  • SRE affinity game
  • SRE guiding principles
  • SRE hobbies

You may have asked whether site reliability engineering is the right profession for you. Let’s examine that next.

SRE affinity game

Let’s play a game! What do you think your affinity or compatibility is with the site reliability engineering profession? We will present a series of scenarios that SREs face. You need to answer them with either love, like, dislike, or hate indicating how much you see yourself doing it and how you would feel about it. Try to be as honest as possible.

Disclaimer

This is not an anthropological scientific survey based on a human behavioral model or theory by any means. It’s a simple questionnaire to help you understand your own affinity to the SRE job role.

The scenarios are in the following list. Get a piece of paper, write down the question number, and answer it. Good luck!

  1. Your boss asks you to resolve a problem that no one else has ever resolved.
  2. You need to spend a few hours looking through logs, metrics, graphs, and events to verify whether there are any new anomalies that were not detected automatically.
  3. You need to participate in an on-call rotation or schedule where you might be called late in the night to respond to a service disruption that has a business impact.
  4. You need to work on a backend system or software that is not visible to external users.
  5. You need to devise new ways to increase a large system’s overall reliability.
  6. You are asked to work on a large-scale problem, which affects hundreds of users and has dozens of components and dependencies, that runs on a hybrid multi-cloud environment.
  7. You are diagnosing a system problem that is making users from a certain geography unable to access their services, and there is great pressure on you.
  8. You need to approach problems with a selected scientific method or data model to uncover facts instead of guessing.
  9. You constantly ask yourself how you could make things around you better and more reliable.
  10. You need to classify and categorize systems information and functionalities so you can isolate causes from effects.
  11. You must diagnose and fix a system problem by investigating components that are not usually visible by going deep into each component configuration as debugging mode is not available.
  12. You need to design a detailed diagram of how the user interacts with a system or software so you can point out where to observe for symptoms.

After you complete this exercise, assign points to each of the answers. If you replied to a scheme with a love answer, assign 5 points to it. For a like answer, you get 3 points. Dislike has a value of 0, and hate is -3 (negative!). Sum your points across all 12 scenarios to get your score, and check the result against the following list:

  • Over 34 points: Your affinity is very high; this is the right career for you
  • From 21 to 34 points: Your affinity is high; you should consider this profession
  • From 13 to 20 points: Your affinity is medium; this may be a good job role for you
  • Below 13 points: SRE may not be your best option

This may be a game, but it will have made you imagine yourself in an SRE’s shoes. We have started to understand the SRE mindset, so let’s check what guides them in the convoluted scenarios listed previously.

SRE guiding principles

Everyone has a conjunction of principles (and values) that acts as their compass. SREs also follow a set of values; they embrace guiding principles to advise them on technical decisions and act as a reliability compass.

Google® coined most of those principles in its site reliability engineering books (https://sre.google/books/), but others appeared later in conference sessions at SREcon (https://www.usenix.org/srecon) and blog posts on many websites.

Again, we have selected some of them as canonical guiding principles based on our experience in assisting customers and organizations in enabling site reliability engineering in their IT shops. The following is the set of guiding principles that are rooted in the SRE persona:

  • Scalable operations
  • Engineering fidelity
  • Observability to the core
  • Well-designed service levels
  • User-perspective notification trigger
  • Blameless postmortems
  • Simplicity

We must remark that such principles are not procedures or prescriptive instructions to accomplish something but guidelines. Don’t worry if you are not familiar with the terminology applied here; we dig into them in a detailed manner throughout the book. Let’s investigate each of them along with their most familiar patterns and anti-patterns.

Scalable operations

The operations team, which includes site reliability engineers, is responsible for managing production systems. They are the first responders for any service disruption when something goes wrong. The scalable operations principle states that this team will not grow proportionally to the system as its load increases. Another way to say that is if the number of active users for the determined service doubles, the operations team size will not double. A more mathematically accurate way to visualize this is through a logarithm growth curve. As the operations team gains technical maturity, eliminates repetitive manual tasks, and adopts automation at large, they will need fewer resources to manage more system load:

Figure 1.2 – A logarithm growth curve

Figure 1.2 – A logarithm growth curve

It is worth mentioning that SREs employ a proactive approach as they strive to identify the root cause of issues and devise solutions to detect or prevent problems. The patterns for this principle are as follows:

  • Identify and eliminate toil whenever possible
  • Document operational procedures as runbooks
  • Train operations teams to use and refine runbooks
  • Adopt automation platforms and automated procedures documented in runbooks at large

The anti-patterns are as follows:

  • Have linear (or exponential) growth for operations teams when the system load rises
  • Operational knowledge is tacit or not documented
  • Automation is the end goal and not merely a way to eliminate toil

Engineering fidelity

This tenet asserts the obvious: site reliability engineers do engineering. Yet it’s not uncommon to see SREs only working on incident, problem, and change management processes. We are not telling you that site reliability engineers don’t get their hands dirty; on the contrary, they do operational and engineering work. This principle exists to guarantee that SREs will have time to excel in both.

The patterns are as follows:

  • Cap operational work at 50% of the available SRE time. The other half is dedicated to engineering solutions and increasing reliability.
  • Share some of the operational work with the development team. Sharing 5% of the operational work is usually recommended, so the development team is prepared to take on SRE work.
  • Send operational overflow work to the development team as they share the same goals.

The anti-patterns are as follows:

  • SREs only work on operational work, resolving incidents, implementing changes, and running root cause analysis (RCA)
  • SREs spend most of their time doing firefighting (incident resolution)
  • Development teams don’t share any responsibilities with the operations team

Observability to the core

Observability is the ability to comprehend the internal state of a system by inspecting its outputs. It extends the monitoring concept by adding layers to expand the system visibility and allows a more proactive posture by detecting anomalies before they become disruptions. This guiding principle craves visibility and discernability of what’s happening inside a system or application by measuring certain signals.

The patterns of observability are as follows:

  • Observe the system behavior through the golden signals; this can be either four (LETS) or five (STELA) signals depending on the school of thought you follow. The LETS acronym stands for latency, error rate, traffic, and saturation. STELA stands for saturation, traffic, error rate, latency, and availability.
  • Have monitoring metrics, events, logs, and traces (MELT) at the SRE’s disposal. These are the fundamental data components of any observability platform.
  • Run synthetic user testing from time to time. This is a method where a bot mimics a user to test system functionality and response times.

The anti-patterns are as follows:

  • Observe only the liveness of the system components, but not from the user’s perspective. For example, checking that components are running versus checking that users can use the system as designed.
  • Lack of user experience monitoring. You don’t have visibility of what’s happening in the user interface.

Well-designed service levels

There’s no way to verify whether a service is being delivered to the target user within the expected and agreed-to parameters without established service levels. Part of the undeniable success of site reliability engineering is due to this redefinition of what a good service level is and how we document it. This tenet aims to have not just well-defined service levels but also well-designed ones that measure the system’s reliability.

The patterns are as follows:

  • Define service-level indicators (SLIs) from the system user angle, then delineate service-level objectives (SLOs) as an aggregation of the former
  • Set the SLO target to less than 100%, so there’s some room for errors (error budget) between 100% and the SLO target to launch new features and enhance overall reliability
  • Establish service-level agreements (SLAs), with penalties and fines if they are not met after the measured SLOs
  • Improve SLOs and increase their targets over time through engineering work carried out by the site reliability engineering team

The anti-patterns are as follows:

  • Define the SLAs first, then measure the SLOs to see whether they are feasible with the current workings of the system.
  • Establish a target of 100% for the SLAs or SLOs. This anti-pattern reduces the team’s ability to release new features (or develop system reliability further) as there’s no space for testing them in production. Soon enough, the whole system will become obsolete or non-competitive in its market.

User-perspective notification trigger

Notification is the process of alerting on-call first responders about service or system performance deterioration or downtime. It translates to when a site reliability engineer must be engaged to resolve an incident. This principle states that triggering a notification of an issue should only happen when this issue is affecting the system user. For instance, we never alert an SRE if the CPU load is high, but the user is not feeling any service degradation.

The pattern is as follows:

  • Alerts are triggered if there are any symptoms at the user level, and if such warnings are actionable, SREs can resolve them

The anti-patterns are as follows:

  • Alerting noise. SREs cannot differentiate between alerts that are mere informative events and ones that affect the system user.
  • Lack of alerting. End users engage with the help desk to notify them that there are problems in the system.

Blameless postmortems

Postmortems are in essence root cause analysis (RCA) acts. They receive a peculiar name to avoid running into the same old pitfall: finding someone or something to be blamed (the root cause) rather than improving the system’s quality and learning from mistakes. Postmortems also focus on questions, such as how to detect, respond to, and repair disruption in the service faster than just uncovering the root cause alone. This tenet is one of the hardest to deploy for a new organization if it has been doing traditional RCAs for some time and requires a blameless culture to support it.

The pattern is as follows:

  • Infinite hows. Ask multiple questions, starting with the term how, to determine enhancements to the system (infrastructure and applications), processes, and knowledge base.

The anti-pattern is as follows:

  • Go back to traditional RCAs where no progress is made on reliability

Simplicity

This guiding principle was imported from the Agile Manifesto. We can’t explain it better than the manifesto (https://agilemanifesto.org/principles.html): “the art of maximizing the amount of work not done.” In other words, it dictates that site reliability engineers are always looking for ways of simplifying and avoiding unnecessary work. They are eager to eliminate toil, that is, repetitive, manually intensive, or low or no business-value tasks. However, inherently as humans, we tend to complicate everything, so ensuring runbooks are kept easy to observe and readable is a good example of this principle.

The pattern is as follows:

  • Keep it simple, stupid (KISS) is a proven design principle from the US Navy that says most systems work better if they are simple to use or follow

The anti-pattern is as follows:

  • Too elaborate processes for SRE work

We just explained our preferred seven guiding principles that site reliability engineers follow in their profession. They are an integral chunk of the SRE mindset. Let’s now cover what SREs do in their free time to overcome learning limitations.

SRE hobbies

Jeremy and I couldn’t agree more about what makes a site reliability engineer rockstar: their hobbies. What you do in your free time for leisure or as a second profession leads to greater levels of conceptual and practical knowledge. The trick is finding a hobby that you have a passion for and that helps in the SRE role.

We can’t tell you what the best-fit extra-curricular activity that will pump up your SRE skills is, but here we list some examples that may interest you, grouped by the skills that they enhance.

Analytical thinking

Site reliability engineers have a good analytical processing capacity. They need to analyze big amounts of data and detect patterns, trends, and anomalies by correlating different data sources. Some engaging hobbies that leverage your analytical thinking are as follows:

  • Chess: Without saying too much about it, this game has its own set of theories and algorithms. It is a good way to practice thinking multiple steps ahead while focusing on the present.
  • Board games: There are plenty of board games that make you analyze information to win. And they make it enjoyable and social.
  • Rubik’s cube: This fun toy is also a good example of simplicity in operation and shape. It presents a complex challenge with a plain design.
  • Video games: Strategy and role-playing games will train your mind in thinking analytically.

Creativity

SREs need to forge new algorithms for observability. They also need to construct new ways of measuring system reliability, as applications and infrastructure components have an uncountable number of arrangements, technologies, and architectures. This may sound cliché, but thinking outside the box is where site reliability engineers shine. Here are some hobbies that may help with your creativity:

  • Algorithms development: Although this may be part of your daily work life, you can find fun in it by developing 2D or 3D video games, for instance. Another option is to contribute to open source software projects in the wider community.
  • Drawing or painting: This is a relaxing and artistic example. It also gets you used to finding inspiration.
  • LEGO®: An across-the-globe famous construction toy, LEGO makes you think about new forms, shapes, structures, and ways of assembly. It also has a robotics range that gives you programming skills as well.
  • Internet-of-Things prototyping: How about developing embedded projects with Arduino® or Raspberry Pi® boards? You need to build both the hardware and firmware for the project to come together.
  • Video games: The ones where you need to build something with blocks and basic structures, such as Minecraft®, are especially useful in nurturing creativity.
  • 3D printing: Author Jeremy is a 3D printing master. Like painting, you can express your art in three dimensions.

Troubleshooting

Troubleshooting is not exclusive to site reliability engineers. Systems administrators must also figure out why a service or system is down and how to repair it. However, SREs use systems thinking and scientific approaches to troubleshoot differently. You need to train your mind to resolve problems logically and calmly, and you’re going to need it. There are plenty of hobbies that can stimulate you to excel in this area. Let’s list some of them:

  • Crossword or jigsaw puzzles: People are addicted to this type of entertainment. It’s an excellent choice to keep the mind sharp and trained
  • Sudoku: This was a trend not long ago, but it is still an excellent way to polish up troubleshooting skills.
  • Video games: The ones full of puzzles, such as Portal, as especially good for troubleshooting practice.

We have now covered the aspects of the site reliability engineer persona. Next, we will look at what makes site reliability engineering professionals unique by comparing them to other roles and listing responsibilities and activities.

You have been reading a chapter from
Becoming a Rockstar SRE
Published in: Apr 2023
Publisher: Packt
ISBN-13: 9781803239224
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at £16.99/month. Cancel anytime