Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon

Site reliability engineering: Nat Welch on what it is and why we need it [Interview]

Save for later
  • 4 min read
  • 26 Sep 2018

article-image

At a time when software systems are growing in complexity, and when the expectations and demands from users have never been more critical, it's easy to forget that just making things work can be a huge challenge. That's where site reliability engineering (SRE) comes in; it's one of the reasons we're starting to see it grow as a discipline and job role.

The central philosophy behind site reliability engineering can be seen in trends like chaos engineering. As Gremlin CTO Matt Fornaciari said, speaking to us in June, "chaos engineering is simply part of the SRE toolkit."

For site reliability engineers, software resilience isn't an optional extra - it's critical. In crude terms, downtime for a retail site means real monetary losses, but the example extends beyond that. Because people and software systems are so interdependent, SRE is a useful way for thinking about how we build software more broadly.

To get to the heart of what site reliability engineering is, I spoke to Nat Welch, an SRE currently working at First Look Media, whose experience includes time at Google and Hillary Clinton's 2016 presidential campaign.

site-reliability-engineering-nat-welch-on-what-it-is-and-why-we-need-it-interview-img-0

Nat has just published a book with Packt called Real-World SRE. You can find it here.

Follow Nat on Twitter: @icco

What is site reliability engineering?


Nat Welch: The idea [of site reliability engineering] is to write and modify software to improve the reliability of a website or system. As a term and field, it was founded by Google in the early 2000s, and has slowly spread across the rest of the industry. Having engineers dedicated to global system health and reliability, working with every layer of the business to improving reliability for systems.

"By building teams of engineers focused exclusively on reliability, there can be someone arguing for and focusing on reliability in a way to improve the speed and efficiency of product teams."

Why do we need site reliability engineering?


Nat Welch: Customers get mad if your website is down. Engineers often were having trouble weighing system reliability work versus new feature work. Because of this, product feature work often takes priority, and reliability decisions are made by guess work. By building teams of engineers focused exclusively on reliability, there can be someone arguing for and focusing on reliability in a way to improve the speed and efficiency of product teams.

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at R$50/month. Cancel anytime

Why do we need SRE now, in 2018?


Nat Welch: Part of it is that people are finally starting to build systems more like how Google has been building for years (heavy use of containers, lots of services, heavily distributed). The other part is a marketing effort by Google so that they can make it easier to hire.

What are the core responsibilities of an SRE? How do they sit within a team?


Nat Welch: SRE is just a specialization of a developer. They sit on equal footing with the rest of the developers on the team, because the system is everyone's responsbility. But while some engineers will focus primarily on new features, SRE will primarily focus on system reliability. This does not mean either side does not work on the other (SRE often write features, product devs often write code to make the system more reliable, etc), it just means their primary focus when defining priorities is different.

What are the biggest challenges for site reliability engineers?


Nat Welch: Communication with everyone (product, finance, executive team, etc.), and focus - it's very easy to get lost in fire fighting.

What are the 3 key skills you need to be a good SRE?


Nat Welch: Communication skills, software development skills, system design skills. You need to be able to write code, review code, work with others, break large projects into small pieces and distribute the work among people, but you also need to be able to take a system (working or broken) and figure out how it is designed and how it works.

Thanks Nat!

Site reliability engineering, then, is a response to a broader change in the types of software infrastructure we are building and using today. It's certainly a role that offers a lot of scope for ambitious and curious developers interested in a range of problems in software development, from UX to security.

If you want to learn more, take a look at Nat's book.