Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Real-World SRE

You're reading from   Real-World SRE The Survival Guide for Responding to a System Outage and Maximizing Uptime

Arrow left icon
Product type Paperback
Published in Aug 2018
Publisher Packt
ISBN-13 9781788628884
Length 340 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Nat Welch Nat Welch
Author Profile Icon Nat Welch
Nat Welch
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Introduction FREE CHAPTER 2. Monitoring 3. Incident Response 4. Postmortems 5. Testing and Releasing 6. Capacity Planning 7. Building Tools 8. User Experience 9. Networking Foundations 10. Linux and Cloud Foundations Other Books You May Enjoy Index

SRE as a framework for new projects

One way to use this book is as a framework for working on a new project. As each chapter is about a different level of the hierarchy, you can work through the book to figure out where in the hierarchy your project sits. If it is a new project, then often it will be right at the bottom of the hierarchy, with no, or very little, monitoring implemented.

At each level, if there are others on the team, then you should begin a conversation to figure out what exists, and if it meets the team's needs. Each chapter will provide a rough rubric for that discussion, but remember that every team and project is unique. If you are the only person who is thinking about reliability and infrastructure, then you may end up spending a significant amount of time proposing solutions and pushing the project in a certain direction. Just remember that the point is to improve the reliability of the service, help the business, and improve the user's experience of the service.

You may find yourself distracted by each thing that you could fix. It is highly recommended to document the problems that you see first before diving in. Documenting first can be helpful in a few ways. Diving in is very satisfying, but it also may lead you to skip over requirements or spend too much time on a solution that doesn't work for your business (for example, integrating your system with a monitoring service you can't afford, or building a distributed job scheduler when you could have just used a piece of open source software).

So, when joining a new project, or evaluating a new service, here is a set of steps to follow:

  • Figure out the team structure. Who owns what? Who is in charge?
  • Find any documentation the team has for their service or the project.
  • Get someone to draw out the system architecture. Have them show you what connects to which service, what depends on the project, how data flows through the service, and how the project is deployed.
    SRE as a framework for new projects

    Figure 3: An example system architecture diagram. This is a very simple diagram that someone might draw on a whiteboard. Most companies will have something much more complex or detailed than this, but this is often the level of detail you need. Boxes with names and arrows show what talks to what.

    SRE as a framework for new projects

    Figure 4: Second example of an architecture diagram. This system is a classic static site generator model. The admin service creates or modifies things and writes update notifications into a queue. A worker reads data from the queue, does work on the data, and uploads it to a static object store, in this case vendor 2. Then, we put in some sort of CDN or serving system, in this case vendor 1 in front of vendor 2.

    Name

    Role

    Manager

    Things they know/specializations

    Akil

    Junior Full Stack Dev

    Jeff

    Seems pretty new and jumps around a lot.

    Catherine

    Senior Frontend Dev

    Jeff

    Does a lot of initial design prototyping and built most of the frontend originally.

    Kareem

    Senior Mobile Dev

    Melissa

    Wrote both mobile apps.

    Steph

    Senior Backend Dev

    Melissa

    TO DO: Set up a one-on-one to understand mobile backend.

    Suzy

    Full Stack Dev

    Jeff

    Animation wizard who knows the database for CMS better than anyone.

    Tom

    Full Stack Dev

    Jeff

    Frontend architecture, made initial protocol buffers and knows sync queue best.

    Table 1: An example table with notes on people in the project. With this, we have a reference on team structure. If we need to know who to talk to about mobile apps, we can look at our handy chart and see that we need to talk to Kareem or the manager, Melissa.

    Now that you have context for the project, or service, start working through each chapter of the book and ask:

  • Does the service have monitoring?
  • Does the team have plans for incident response?
  • Does the team create postmortems? Are they stored anywhere?
  • How is the service tested? Does the project have a release plan?
  • Has anyone done any capacity planning?
  • What tools could we build to improve the service?
  • Is the current level of reliability providing a positive user experience?

Tip

The trick to note here is that these questions could be asked about a piece of software that has been running for years, as well as one that is just being created.

The service you are investigating could be a large project with many pieces of software (a service-oriented architecture (SOA) for example) or a single monolithic application. If you are working on a project with many services, then work through each service one at a time. The downside of this can be that if you want to build a framework that will fit all of the services you are interacting with, you will not know how best to solve the problems and needs of them until after you have done a bunch of research and work. The upside is that you will not be pulled immediately in many directions and will be able to focus on one specific service's problems.

Your time and energy are limited resources and, because of this, you will always need to work with more people than you have time for, so make sure to take it slow. Going slow will mean that things do not get lost in the cracks. You also do not want to burn out before each service has its base few levels of its hierarchy filled up.

You have been reading a chapter from
Real-World SRE
Published in: Aug 2018
Publisher: Packt
ISBN-13: 9781788628884
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image