Understanding the mindset and hobbies of an SRE
It’s not rare for site reliability engineers to have a broader and divergent view of their surroundings. We are not saying that SREs are weird; well, they are in a certain sense, as they employ a relentless search for improving reliability in all things. However, we are referring to their mindset and how they approach the world.
In this section, we will explore different aspects of their thought process in the work environment and what they like to do in the job and outside it. We have divided this topic into three sections:
- SRE affinity game
- SRE guiding principles
- SRE hobbies
You may have asked whether site reliability engineering is the right profession for you. Let’s examine that next.
SRE affinity game
Let’s play a game! What do you think your affinity or compatibility is with the site reliability engineering profession? We will present a series of scenarios that SREs face. You need to answer them with either love, like, dislike, or hate indicating how much you see yourself doing it and how you would feel about it. Try to be as honest as possible.
Disclaimer
This is not an anthropological scientific survey based on a human behavioral model or theory by any means. It’s a simple questionnaire to help you understand your own affinity to the SRE job role.
The scenarios are in the following list. Get a piece of paper, write down the question number, and answer it. Good luck!
- Your boss asks you to resolve a problem that no one else has ever resolved.
- You need to spend a few hours looking through logs, metrics, graphs, and events to verify whether there are any new anomalies that were not detected automatically.
- You need to participate in an on-call rotation or schedule where you might be called late in the night to respond to a service disruption that has a business impact.
- You need to work on a backend system or software that is not visible to external users.
- You need to devise new ways to increase a large system’s overall reliability.
- You are asked to work on a large-scale problem, which affects hundreds of users and has dozens of components and dependencies, that runs on a hybrid multi-cloud environment.
- You are diagnosing a system problem that is making users from a certain geography unable to access their services, and there is great pressure on you.
- You need to approach problems with a selected scientific method or data model to uncover facts instead of guessing.
- You constantly ask yourself how you could make things around you better and more reliable.
- You need to classify and categorize systems information and functionalities so you can isolate causes from effects.
- You must diagnose and fix a system problem by investigating components that are not usually visible by going deep into each component configuration as debugging mode is not available.
- You need to design a detailed diagram of how the user interacts with a system or software so you can point out where to observe for symptoms.
After you complete this exercise, assign points to each of the answers. If you replied to a scheme with a love answer, assign 5 points to it. For a like answer, you get 3 points. Dislike has a value of 0, and hate is -3 (negative!). Sum your points across all 12 scenarios to get your score, and check the result against the following list:
- Over 34 points: Your affinity is very high; this is the right career for you
- From 21 to 34 points: Your affinity is high; you should consider this profession
- From 13 to 20 points: Your affinity is medium; this may be a good job role for you
- Below 13 points: SRE may not be your best option
This may be a game, but it will have made you imagine yourself in an SRE’s shoes. We have started to understand the SRE mindset, so let’s check what guides them in the convoluted scenarios listed previously.
SRE guiding principles
Everyone has a conjunction of principles (and values) that acts as their compass. SREs also follow a set of values; they embrace guiding principles to advise them on technical decisions and act as a reliability compass.
Google® coined most of those principles in its site reliability engineering books (https://sre.google/books/), but others appeared later in conference sessions at SREcon (https://www.usenix.org/srecon) and blog posts on many websites.
Again, we have selected some of them as canonical guiding principles based on our experience in assisting customers and organizations in enabling site reliability engineering in their IT shops. The following is the set of guiding principles that are rooted in the SRE persona:
- Scalable operations
- Engineering fidelity
- Observability to the core
- Well-designed service levels
- User-perspective notification trigger
- Blameless postmortems
- Simplicity
We must remark that such principles are not procedures or prescriptive instructions to accomplish something but guidelines. Don’t worry if you are not familiar with the terminology applied here; we dig into them in a detailed manner throughout the book. Let’s investigate each of them along with their most familiar patterns and anti-patterns.
Scalable operations
The operations team, which includes site reliability engineers, is responsible for managing production systems. They are the first responders for any service disruption when something goes wrong. The scalable operations principle states that this team will not grow proportionally to the system as its load increases. Another way to say that is if the number of active users for the determined service doubles, the operations team size will not double. A more mathematically accurate way to visualize this is through a logarithm growth curve. As the operations team gains technical maturity, eliminates repetitive manual tasks, and adopts automation at large, they will need fewer resources to manage more system load:
Figure 1.2 – A logarithm growth curve
It is worth mentioning that SREs employ a proactive approach as they strive to identify the root cause of issues and devise solutions to detect or prevent problems. The patterns for this principle are as follows:
- Identify and eliminate toil whenever possible
- Document operational procedures as runbooks
- Train operations teams to use and refine runbooks
- Adopt automation platforms and automated procedures documented in runbooks at large
The anti-patterns are as follows:
- Have linear (or exponential) growth for operations teams when the system load rises
- Operational knowledge is tacit or not documented
- Automation is the end goal and not merely a way to eliminate toil
Engineering fidelity
This tenet asserts the obvious: site reliability engineers do engineering. Yet it’s not uncommon to see SREs only working on incident, problem, and change management processes. We are not telling you that site reliability engineers don’t get their hands dirty; on the contrary, they do operational and engineering work. This principle exists to guarantee that SREs will have time to excel in both.
The patterns are as follows:
- Cap operational work at 50% of the available SRE time. The other half is dedicated to engineering solutions and increasing reliability.
- Share some of the operational work with the development team. Sharing 5% of the operational work is usually recommended, so the development team is prepared to take on SRE work.
- Send operational overflow work to the development team as they share the same goals.
The anti-patterns are as follows:
- SREs only work on operational work, resolving incidents, implementing changes, and running root cause analysis (RCA)
- SREs spend most of their time doing firefighting (incident resolution)
- Development teams don’t share any responsibilities with the operations team
Observability to the core
Observability is the ability to comprehend the internal state of a system by inspecting its outputs. It extends the monitoring concept by adding layers to expand the system visibility and allows a more proactive posture by detecting anomalies before they become disruptions. This guiding principle craves visibility and discernability of what’s happening inside a system or application by measuring certain signals.
The patterns of observability are as follows:
- Observe the system behavior through the golden signals; this can be either four (LETS) or five (STELA) signals depending on the school of thought you follow. The LETS acronym stands for latency, error rate, traffic, and saturation. STELA stands for saturation, traffic, error rate, latency, and availability.
- Have monitoring metrics, events, logs, and traces (MELT) at the SRE’s disposal. These are the fundamental data components of any observability platform.
- Run synthetic user testing from time to time. This is a method where a bot mimics a user to test system functionality and response times.
The anti-patterns are as follows:
- Observe only the liveness of the system components, but not from the user’s perspective. For example, checking that components are running versus checking that users can use the system as designed.
- Lack of user experience monitoring. You don’t have visibility of what’s happening in the user interface.
Well-designed service levels
There’s no way to verify whether a service is being delivered to the target user within the expected and agreed-to parameters without established service levels. Part of the undeniable success of site reliability engineering is due to this redefinition of what a good service level is and how we document it. This tenet aims to have not just well-defined service levels but also well-designed ones that measure the system’s reliability.
The patterns are as follows:
- Define service-level indicators (SLIs) from the system user angle, then delineate service-level objectives (SLOs) as an aggregation of the former
- Set the SLO target to less than 100%, so there’s some room for errors (error budget) between 100% and the SLO target to launch new features and enhance overall reliability
- Establish service-level agreements (SLAs), with penalties and fines if they are not met after the measured SLOs
- Improve SLOs and increase their targets over time through engineering work carried out by the site reliability engineering team
The anti-patterns are as follows:
- Define the SLAs first, then measure the SLOs to see whether they are feasible with the current workings of the system.
- Establish a target of 100% for the SLAs or SLOs. This anti-pattern reduces the team’s ability to release new features (or develop system reliability further) as there’s no space for testing them in production. Soon enough, the whole system will become obsolete or non-competitive in its market.
User-perspective notification trigger
Notification is the process of alerting on-call first responders about service or system performance deterioration or downtime. It translates to when a site reliability engineer must be engaged to resolve an incident. This principle states that triggering a notification of an issue should only happen when this issue is affecting the system user. For instance, we never alert an SRE if the CPU load is high, but the user is not feeling any service degradation.
The pattern is as follows:
- Alerts are triggered if there are any symptoms at the user level, and if such warnings are actionable, SREs can resolve them
The anti-patterns are as follows:
- Alerting noise. SREs cannot differentiate between alerts that are mere informative events and ones that affect the system user.
- Lack of alerting. End users engage with the help desk to notify them that there are problems in the system.
Blameless postmortems
Postmortems are in essence root cause analysis (RCA) acts. They receive a peculiar name to avoid running into the same old pitfall: finding someone or something to be blamed (the root cause) rather than improving the system’s quality and learning from mistakes. Postmortems also focus on questions, such as how to detect, respond to, and repair disruption in the service faster than just uncovering the root cause alone. This tenet is one of the hardest to deploy for a new organization if it has been doing traditional RCAs for some time and requires a blameless culture to support it.
The pattern is as follows:
- Infinite hows. Ask multiple questions, starting with the term how, to determine enhancements to the system (infrastructure and applications), processes, and knowledge base.
The anti-pattern is as follows:
- Go back to traditional RCAs where no progress is made on reliability
Simplicity
This guiding principle was imported from the Agile Manifesto. We can’t explain it better than the manifesto (https://agilemanifesto.org/principles.html): “the art of maximizing the amount of work not done.” In other words, it dictates that site reliability engineers are always looking for ways of simplifying and avoiding unnecessary work. They are eager to eliminate toil, that is, repetitive, manually intensive, or low or no business-value tasks. However, inherently as humans, we tend to complicate everything, so ensuring runbooks are kept easy to observe and readable is a good example of this principle.
The pattern is as follows:
- Keep it simple, stupid (KISS) is a proven design principle from the US Navy that says most systems work better if they are simple to use or follow
The anti-pattern is as follows:
- Too elaborate processes for SRE work
We just explained our preferred seven guiding principles that site reliability engineers follow in their profession. They are an integral chunk of the SRE mindset. Let’s now cover what SREs do in their free time to overcome learning limitations.
SRE hobbies
Jeremy and I couldn’t agree more about what makes a site reliability engineer rockstar: their hobbies. What you do in your free time for leisure or as a second profession leads to greater levels of conceptual and practical knowledge. The trick is finding a hobby that you have a passion for and that helps in the SRE role.
We can’t tell you what the best-fit extra-curricular activity that will pump up your SRE skills is, but here we list some examples that may interest you, grouped by the skills that they enhance.
Analytical thinking
Site reliability engineers have a good analytical processing capacity. They need to analyze big amounts of data and detect patterns, trends, and anomalies by correlating different data sources. Some engaging hobbies that leverage your analytical thinking are as follows:
- Chess: Without saying too much about it, this game has its own set of theories and algorithms. It is a good way to practice thinking multiple steps ahead while focusing on the present.
- Board games: There are plenty of board games that make you analyze information to win. And they make it enjoyable and social.
- Rubik’s cube: This fun toy is also a good example of simplicity in operation and shape. It presents a complex challenge with a plain design.
- Video games: Strategy and role-playing games will train your mind in thinking analytically.
Creativity
SREs need to forge new algorithms for observability. They also need to construct new ways of measuring system reliability, as applications and infrastructure components have an uncountable number of arrangements, technologies, and architectures. This may sound cliché, but thinking outside the box is where site reliability engineers shine. Here are some hobbies that may help with your creativity:
- Algorithms development: Although this may be part of your daily work life, you can find fun in it by developing 2D or 3D video games, for instance. Another option is to contribute to open source software projects in the wider community.
- Drawing or painting: This is a relaxing and artistic example. It also gets you used to finding inspiration.
- LEGO®: An across-the-globe famous construction toy, LEGO makes you think about new forms, shapes, structures, and ways of assembly. It also has a robotics range that gives you programming skills as well.
- Internet-of-Things prototyping: How about developing embedded projects with Arduino® or Raspberry Pi® boards? You need to build both the hardware and firmware for the project to come together.
- Video games: The ones where you need to build something with blocks and basic structures, such as Minecraft®, are especially useful in nurturing creativity.
- 3D printing: Author Jeremy is a 3D printing master. Like painting, you can express your art in three dimensions.
Troubleshooting
Troubleshooting is not exclusive to site reliability engineers. Systems administrators must also figure out why a service or system is down and how to repair it. However, SREs use systems thinking and scientific approaches to troubleshoot differently. You need to train your mind to resolve problems logically and calmly, and you’re going to need it. There are plenty of hobbies that can stimulate you to excel in this area. Let’s list some of them:
- Crossword or jigsaw puzzles: People are addicted to this type of entertainment. It’s an excellent choice to keep the mind sharp and trained
- Sudoku: This was a trend not long ago, but it is still an excellent way to polish up troubleshooting skills.
- Video games: The ones full of puzzles, such as Portal, as especially good for troubleshooting practice.
We have now covered the aspects of the site reliability engineer persona. Next, we will look at what makes site reliability engineering professionals unique by comparing them to other roles and listing responsibilities and activities.