Becoming a Rockstar SRE

SRE Job Role – Activities and Responsibilities

A lot has been said about site reliability engineering, what it is, what it is not, and the multiple practices and techniques that we should apply to adopt the site reliability engineering model. Who site reliability engineers (SREs) are is often put aside even though it is a crucial aspect. Moreover, how people from various parts of information technology (IT) become SREs and how some of them are recognized as thought leaders in this domain.

However, little has been said about the site reliability engineer persona, as detailed in the following list:

What do they know?
Which skills have they developed?
What do they do daily?
What are their primary responsibilities?

Those characteristics would explain, at a bare minimum, why someone should start the journey to becoming an SRE rockstar. That’s precisely why we decided to start this book by outlining the SRE job role.

In this chapter, we’re going to cover the following main topics:

Making this journey personal
Understanding the mindset and hobbies of an SRE
DevOps engineers versus SRE versus others
Describing an SRE’s main responsibilities
An overview of the daily activities of an SRE
People that inspire

Making this journey personal

Unfortunately, often when an enterprise starts to adopt SRE into their IT governance processes, they don’t use a people-processes-tools (PPT) model to transform their operations and software development areas, having a clear vision of these pillars. Even more often, they don’t emphasize or focus on the people element of PPT in such transformations. We want to change that by making this learning journey personal and centered on the individuals rather than the involved processes or technologies.

It’s critical to understand (and learn) what drives typical SREs forward, which fundamental skills they have developed, and how they hone their skills over time to go above and beyond at work. For that purpose, we will divide this subject into three sections:

SRE driving forces
SRE skills
SRE traits

Let’s start this personal journey by understanding why you should become an SRE.

SRE driving forces

We want to explore what motivates or incentivizes site reliability engineers. There’s no journey of any nature if there is no driving force pushing you through. As a word of advice, we should warn you that learning about site reliability engineering is more of an expedition than a tourism trip. In other words, it’s more a marathon than a sprint. Having clarified that, we’ll begin by putting the possible rewards of this journey on the table. Let’s depict each driving force as a mockup code snippet (JavaScript) to make it fun.

Money

If we could represent in the form of an algorithm how money drives people when they don’t earn enough, it would look like the following:

// money
if (money < MyMinimumSalary) {
motivated = false;
excitement--;
}
doMyWork();
if (motivated && jobSatisfaction) {
    honeSRESkills();
    doExtraWork();
} else lookForAnotherJob();

Site reliability engineers make more money than most other technical professionals. According to a Glassdoor (2022) report, they can earn more than USD 118K per year on average. In similar reports, SREs are even noted to have surpassed DevOps engineers in a salary comparison. Nevertheless, not making enough money can be a key demotivating factor. It is hard for anyone to move forward with their career if they are preoccupied with expenses.

Although SREs have a notorious income on average, their salaries will vary per country, years of experience, and employer. Companies justify SRE salary levels based on the reliability value they bring to the table. Rest assured, the site reliability engineering career is well paved in the compensation field.

Job satisfaction

What affects our job satisfaction can be depicted as code logic as follows:

// jobSatisfaction
if (interestingJob || purposefulWorkActivities || challengingSkillDevelopment || technicalAppreciation) {
    jobSatisfaction = true;
    excitement++;
}

Job satisfaction is another driving force of site reliability engineers, and it has many factors. We usually translate job satisfaction to employee happiness at work. Site reliability engineering leads to job satisfaction when we look at the following profession characteristics: exciting job content, purposeful work activities, challenging skill development, and technical appreciation.

The job content of site reliability engineering spans multiple domains. You can work with developers one day and help systems administrators the next. You may need to assist in redesigning an app to increase its service reliability. As with any generalist model job with technical depth in many subject areas, you will never get bored for sure.

As we will see later in this chapter, SRE work activities have clear business value. They improve not just the service quality, availability, and resiliency, but also the system’s reliability. Reliable services might help with customer loyalty, bringing additional revenue to the service provider. There is a direct relationship between SRE work and business metrics improvement, making their efforts purposeful.

Since site reliability engineering is a cross-technology domain engineering discipline, any skills acquisition is challenging. SREs have knowledge and skills that a systems administrator or software developer doesn’t have. They are required to keep those skills updated and hone them over time. This necessity to keep learning brings the always-moving-forward feeling that may not happen if you only need to master a single product or technology.

The last factor on our list is technical appreciation. According to Boston Consulting Group (BCG) research, appreciation is the number one job happiness factor. Being an SRE, you will aid customers, users, and other technical professionals because of your keen holistic view of the systems. Consequently, technical appreciation for the job you do is common, and who doesn’t like that?

Innovative solutions

The following code gives you an idea of how exciting exploring uncharted terrains is:

If (!solutionExists) {
    deviseNewSolution();
    excitement++;
}

Site reliability engineers are natural trailblazers as they explore new technologies and processes to obtain better reliability and eliminate toil (manual and repetitive tasks that are devoid of value). They face many scenarios and situations that are a first of their kind. Moreover, they are responsible for paving the path for others by documenting procedures in runbooks when none exist. There’s nothing more exciting than devising new solutions or improving existing ones. Imagine how you would feel if they named a technical operating procedure after you.

Nevertheless, SREs want to minimize complexity and reduce technical debt. They don’t create a solution just for the sake of doing it unless it adds value and resolves or prevents events that impact customers.

Good relationships

The following code snippet is a representation of how good relationships are a result of an exciting working environment:

If (excitement > HIGH) {
    motivateOthers();
    relationships.healthy = true;
}

Also, good work environment relationships are one of the top 10 factors contributing to employee happiness. SREs have good relationships in their work environment. The reason is straightforward; they act as integration hubs among different tribes and have the mission to break company siloes. SREs need cooperation from both development and operations teams. They are technical diplomats and have strong communication skills. Since they are usually excited about their work, they tend to socialize more with colleagues and leaders, potentially helping to improve the social environment around them. That doesn’t mean they need to be extroverts with progressive public-speaking skills, but certainly, SREs are good teachers because they are excited and compelled to talk about what they do.

SRE skills

Now that you know what’s in it for you, it’s time to check which skillsets SREs must develop throughout their careers. Site reliability engineers have a good mix of knowledge, skills, and experience that are shared with other roles and those that are unique to them. SREs have technical skills that span the entire solution life cycle, from the design to the manage step.

Figure 1.1 – SRE skills

The preceding Venn diagram shows how SREs acquire skills common to other professions and how SRE skills connect the various steps of the solution life cycle. In essence, site reliability engineers are senior technical resources that follow a generalist proficiency model with good depth at certain areas of expertise.

There’s no consensus in the market about the canonical set of skills for SREs. It would not make sense for this to be the case because as soon as any technology-based skill becomes obsolete, we would need to remodel the whole profession. Instead, SRE core skills should be as technology-agnostic as possible.

We recommend a blend of distinct expertise from the IT architect, software developer, data scientist, DevOps engineer, and systems administrator roles. The proportion of each skill level varies per a multitude of factors. You will need to determine which skills are more in demand than the others, but an organization should have all of them in its toolkit.

Systems thinking

Site reliability engineers have a holistic view of the system’s reliability by understanding the availability, resiliency, and performance of each solution component at both the application and infrastructure levels.

Software engineering

SREs develop code and software. They know how to utilize algorithms and software development techniques such as agile frameworks. SREs need to be proficient enough in instrumenting the app code to increase its manageability and observability. SREs know how to use software development life cycle tools and technologies, including DevOps (continuous integration/continuous delivery – CI/CD) pipelines. They can provide testers with better test cases that consider service reliability targets.

Systems management

SREs know how to manage, administer, and operate systems. They share most of the skills from the systems administrator role on multiple technologies. Their technology knowledge spans the cloud, containers, storage, networking, operating systems, middleware, and databases. They have the skills to implement monitoring, event management, logging, tracing, service levels, observability, DevOps toolchains, and automation of toil.

Data science

SREs work with huge amounts of structured data. They must acquire the knowledge and skills to make sense of such datasets by using mathematical models. SREs know how to analyze data to uncover trends, anomalies, and insights – always from the user’s perspective.

We recommend that every SRE has the following selection of core skills:

Systems thinking for focusing on the reliability of the system
The ability to develop and test software
The ability to deploy and release apps
IT service management
Systems monitoring and observability
Working with DevOps tools and automation
The application of data science for reliability of systems

Although we didn’t explicitly mention security in any of the knowledge domains, enforcing security across multiple layers is present in all of them.

Important note

All fundamental SRE skills are covered in this book’s chapters. The chapters have been organized to optimize your learning journey, so they don’t follow the preceding order of skills. We structured this book based on our own experience acquired from a multitude of site reliability engineering coaching and mentoring sessions.

We provide a manifesto model in the Appendix A, The Site Reliability Engineer Manifesto, that acts as a more structured guide for site reliability engineering adoption, including the fundamental skills. We hope that helps your company in joining the site reliability engineering movement.

SRE traits

Besides what the SREs know and which skills they must develop, it’s relevant to know their other good traits.

Software is everywhere

Site reliability engineers have a software engineering mindset. The idea of approaching any issue as a software problem may be disruptive at first; however, there is a good reason for it. Imagine that you need to restart and verify a system by manually issuing a specific set of commands and parameters many times per week. If you handle it as a software development problem, the solution will be developing and scheduling a simple program or automation to execute this task instead. SREs embrace automation over toil as one of their best tools.

Comfortable to code

SREs are not just able to develop code; they really like doing it. As we will see later in this chapter, they develop code as a frequent activity and main responsibility. It’s not just a question of learning how to code or program when someone asks; SREs always feel confident in constructing good pieces of software.

Change as a constant

Frequent releases of new features, code enhancements, and reliability improvements are vital for any business. SREs are the first to accept calculated risks to provide more value to the system users. They are not risk averse but bring visibility to inherent risks so they can throttle the speed of change. They are always prepared to make progress and go above and beyond for service reliability.

Handle complexity and scale

They are not afraid of complexity or scale. They know that modern workloads are intrinsically complex and must be scalable horizontally and vertically. SREs work with large systems with multiple components running in hybrid multi-cloud environments. They understand the application’s full-stack design, its moving parts, and how they connect to each other.

Problems as opportunities

SREs participate in on-call rotations and schedules to respond to service disruptions. They see incidents and problems as opportunities to learn and advance the reliability of the system. Not just that, they also have the competence to translate technology into business language to measure the impact on users and customers. They advocate for a blameless culture by prioritizing answers to questions, such as how to detect and repair incidents faster next time. They also consider how technical challenges may affect business results.

We have just gone over what makes the SRE persona: their motivations, skills, and traits. Now we are going to understand how site reliability engineers think.

Understanding the mindset and hobbies of an SRE

It’s not rare for site reliability engineers to have a broader and divergent view of their surroundings. We are not saying that SREs are weird; well, they are in a certain sense, as they employ a relentless search for improving reliability in all things. However, we are referring to their mindset and how they approach the world.

In this section, we will explore different aspects of their thought process in the work environment and what they like to do in the job and outside it. We have divided this topic into three sections:

SRE affinity game
SRE guiding principles
SRE hobbies

You may have asked whether site reliability engineering is the right profession for you. Let’s examine that next.

SRE affinity game

Let’s play a game! What do you think your affinity or compatibility is with the site reliability engineering profession? We will present a series of scenarios that SREs face. You need to answer them with either love, like, dislike, or hate indicating how much you see yourself doing it and how you would feel about it. Try to be as honest as possible.

Disclaimer

This is not an anthropological scientific survey based on a human behavioral model or theory by any means. It’s a simple questionnaire to help you understand your own affinity to the SRE job role.

The scenarios are in the following list. Get a piece of paper, write down the question number, and answer it. Good luck!

Your boss asks you to resolve a problem that no one else has ever resolved.
You need to spend a few hours looking through logs, metrics, graphs, and events to verify whether there are any new anomalies that were not detected automatically.
You need to participate in an on-call rotation or schedule where you might be called late in the night to respond to a service disruption that has a business impact.
You need to work on a backend system or software that is not visible to external users.
You need to devise new ways to increase a large system’s overall reliability.
You are asked to work on a large-scale problem, which affects hundreds of users and has dozens of components and dependencies, that runs on a hybrid multi-cloud environment.
You are diagnosing a system problem that is making users from a certain geography unable to access their services, and there is great pressure on you.
You need to approach problems with a selected scientific method or data model to uncover facts instead of guessing.
You constantly ask yourself how you could make things around you better and more reliable.
You need to classify and categorize systems information and functionalities so you can isolate causes from effects.
You must diagnose and fix a system problem by investigating components that are not usually visible by going deep into each component configuration as debugging mode is not available.
You need to design a detailed diagram of how the user interacts with a system or software so you can point out where to observe for symptoms.

After you complete this exercise, assign points to each of the answers. If you replied to a scheme with a love answer, assign 5 points to it. For a like answer, you get 3 points. Dislike has a value of 0, and hate is -3 (negative!). Sum your points across all 12 scenarios to get your score, and check the result against the following list:

Over 34 points: Your affinity is very high; this is the right career for you
From 21 to 34 points: Your affinity is high; you should consider this profession
From 13 to 20 points: Your affinity is medium; this may be a good job role for you
Below 13 points: SRE may not be your best option

This may be a game, but it will have made you imagine yourself in an SRE’s shoes. We have started to understand the SRE mindset, so let’s check what guides them in the convoluted scenarios listed previously.

SRE guiding principles

Everyone has a conjunction of principles (and values) that acts as their compass. SREs also follow a set of values; they embrace guiding principles to advise them on technical decisions and act as a reliability compass.

Google® coined most of those principles in its site reliability engineering books (https://sre.google/books/), but others appeared later in conference sessions at SREcon (https://www.usenix.org/srecon) and blog posts on many websites.

Again, we have selected some of them as canonical guiding principles based on our experience in assisting customers and organizations in enabling site reliability engineering in their IT shops. The following is the set of guiding principles that are rooted in the SRE persona:

Scalable operations
Engineering fidelity
Observability to the core
Well-designed service levels
User-perspective notification trigger
Blameless postmortems
Simplicity

We must remark that such principles are not procedures or prescriptive instructions to accomplish something but guidelines. Don’t worry if you are not familiar with the terminology applied here; we dig into them in a detailed manner throughout the book. Let’s investigate each of them along with their most familiar patterns and anti-patterns.

Scalable operations

The operations team, which includes site reliability engineers, is responsible for managing production systems. They are the first responders for any service disruption when something goes wrong. The scalable operations principle states that this team will not grow proportionally to the system as its load increases. Another way to say that is if the number of active users for the determined service doubles, the operations team size will not double. A more mathematically accurate way to visualize this is through a logarithm growth curve. As the operations team gains technical maturity, eliminates repetitive manual tasks, and adopts automation at large, they will need fewer resources to manage more system load:

Figure 1.2 – A logarithm growth curve

It is worth mentioning that SREs employ a proactive approach as they strive to identify the root cause of issues and devise solutions to detect or prevent problems. The patterns for this principle are as follows:

Identify and eliminate toil whenever possible
Document operational procedures as runbooks
Train operations teams to use and refine runbooks
Adopt automation platforms and automated procedures documented in runbooks at large