DevOps engineers versus SRE versus others
This is one of the most frequently asked questions we receive from customers and organizations: how does the site reliability engineering profession differ from other existing technical roles? We already talked about how SREs are the connection between the different steps of the solution life cycle. Here, we’ll focus our discussion on the DevOps engineer role, and later, we’ll broaden it. We have split this discussion into two sections:
- DevOps and site reliability engineers
- Software and site reliability engineers
DevOps and site reliability engineers
Google described the relationship between DevOps and SRE with a famous subtitle in their The Site Reliability Workbook publication:
This statement is an elegant way to define this link and refers to Java programming. It implies that site reliability engineering describes and deepens the implementation of whatever DevOps is. Moreover, we can say that site reliability engineering has commonalities with DevOps as a logically derived conclusion. However, what exactly does site reliability engineering implement from DevOps, or what are the differences between a site reliability engineer and a DevOps engineer? We have visualized these similarities and divergences in an infographic as follows:
Figure 1.3 – An infographic on SRE and DevOps
Notice that they have shared values. Both SREs and DevOps engineers require those values in the orange (bottom right in the above diagram) box. In the bottom-left table, you can see the difference between those roles. Typically, site reliability engineers resolve operational problems by applying the right software engineering disciplines. On the other hand, DevOps engineers resolve development and delivery pipeline issues with systems management techniques mainly by using automation and infrastructure-as-code. They also concentrate different levels of effort on distinct phases of the solution life cycle, as depicted in the infographic.
It’s not rare to hear that DevOps is a shift-right transformation while site reliability engineering is a shift-left one. That implies moving from the left (development side of the equation) to the right (operations side of the equation), and vice versa. Another term we hear a lot is DevSecOps, which has the addition of security. Since security has always been implied in these roles, we think including new letters in the middle is confusing and redundant.
SREs and DevOps engineers are, in our opinion, different sides of the same coin. They should be more like best friends forever than opposing roles as they share values. Let’s check how SREs fulfill those values from the five main areas of DevOps:
- Reduce organizational silos: SREs use the same tooling as developers or DevOps engineers. They also share objectives and performance metrics with them.
- Accept failure as normal: SREs embrace risks using the error budget for new features. They quantify failure through SLIs and SLOs. And they run postmortems in a blameless culture.
- Implement gradual changes: SREs work to increase reliability, and more reliable systems allow more frequent changes and releases.
- Leverage tooling and automation: SREs eliminate toil by automating operational tasks at a constant pace.
- Measure everything: SREs measure reliability by implementing MELT data and observability. They also have ways to identify and size toil.
Software and site reliability engineers
Another frequently asked question is how site reliability engineers differ from software engineers (SWEs). The short answer is simple: they have the same core skills but specific work scopes.
What are SWEs? SWEs design, engineer, and architect applications using modeling languages and requirements analysis techniques. They implement an integrated development environment (IDE) and develop code for use cases using one of the multiple available programming languages. They create test cases and testing suites. Also, they integrate software and service components and handle their dependencies. SWEs work with many software development life cycle tools and processes.
Site reliability engineers may execute the same activities, but they intend to improve reliability when doing so. For instance, developing code for an SRE translates much more to instrumenting the application code, so it generates more logs, than coding a use case. Also, SREs treat operations as a software problem and see daily systems management tasks as possible software coding opportunities. Besides that, SREs have other core skills, relating to systems thinking, systems management, and data science.
Indeed, an SRE could become an SWE and vice versa, and that leads us to another principle that we find in the Google materials.
Common staffing pool
Another principle is hiring site reliability engineers and SWEs from the same staffing pool. This principle works well for companies where most employees are software developers and engineers, and having a shared pool means that site reliability and software engineering job roles are interchangeable. However, this principle may be much more challenging for enterprises with a mix of systems administrators and developers. Hence, we left it out of our list in the previous section.
We could compare the SRE’s unique profession to many others, but we limited this topic to the most common comparisons. SREs are not architects, developers, systems administrators, or data scientists; they are more than all of these roles combined. Up next, we are going to understand the primary responsibilities of an SRE.