What this book covers
Chapter 1, SRE Job Role – Activities and Responsibilities, talks about the site reliability engineer persona addressing who is an SRE.
Chapter 2, Fundamental Numbers – Reliability Statistics, shows how the site reliability engineering work and business impact are measured.
Chapter 3, Imperfect Habits – Duct Tape Architecture and Spaghetti Code, explains why systems are naturally unreliable.
Chapter 4, Essential Observability – Metrics, Events, Logs, and Traces (MELT), discusses how we go from monitoring to true observability.
Chapter 5, Resolution Path – Master Troubleshooting, lectures on the SRE way of precisely and concisely troubleshooting.
Chapter 6, Operational Framework – Managing Infrastructure and Systems, describes why and how SREs tackle operational work and not just engineering duties.
Chapter 7, Data Consumed – Observability Data Science, teaches the basic mathematical models and statistical methods for SREs.
Chapter 8, Reliable Architecture – Systems Strategy and Design, describes systems thinking applied to reliability and reliable architectural patterns.
Chapter 9, Valued Automation – Toil Discovery and Elimination, familiarizes readers with a critical pillar of site reliability engineering: making operations scalable.
Chapter 10, Exposing Pipelines – GitOps and Testing Essentials, illustrates how to leverage reliability inside DevOps delivery pipelines.
Chapter 11, Worker Bees – Orchestrations of Serverless, Containers, and Kubernetes, presents how workload management affects the reliability of systems.
Chapter 12, Final Exam – Tests and Capacity Planning, demonstrates how good testing and capacity planning keep the performance of systems ahead.
Chapter 13, First Thing – Runbooks and Low Noise Outage Notifications, discusses how well-designed procedures and notifications prepare SREs for problems.
Chapter 14, Rapid Response – Outage Management Techniques, teaches about SRE positive behaviors and how to keep interactions toward the resolution during a significant incident.
Chapter 15, Postmortem Candor – Long-Term Resolution, portrays how postmortems should lead to actions that will make systems more reliable.
Chapter 16, Chaos Injector – Advanced Systems Stability, clarifies how SREs inject chaos into systems to learn more and use gamification to hone their skills.
Chapter 17, Interview Advice – Hiring and Being Hired, displays how companies should hire SREs and how SREs should demonstrate their knowledge during an interview.
Appendix A, The Site Reliability Engineer Manifesto, depicts the primary responsibilities of any SRE in the world.
Appendix B, The 12-Factor App Questionnaire, consolidates a series of questions to test whether an application design is reliable according to the twelve-factor app manifesto from Heroku.