Section 1: Site Reliability Engineering – A Prescriptive Way to Implement DevOps
The core focus of this section is Site Reliability Engineering (SRE). The section starts with the evolution of DevOps and its life cycle and introduces SRE as a prescriptive way to implement DevOps. The emphasis is on defining the reliability of a service and ways to measure it. In this section, SRE technical practices such as Service-Level Agreement (SLA), Service-Level Objective (SLO), Service-Level Indicator (SLI), error budget, and eliminating toil are introduced and are further elaborated. Fundamentals and critical concepts around monitoring, alerting, and time series are also explored to track SRE technical practices. However, SRE technical practices cannot be implemented within an organization without bringing cultural change. Google prescribes a set of SRE cultural practices such as incident management, being on call, psychological safety, and the need to foster collaboration. This section...