Chapter 20: Introducing Site Reliability Engineering in Multi-Cloud
This book has dealt with designing, implementing, and controlling a multi-cloud platform. That said, we have built that for a reason—to host applications. Applications cannot live without infrastructure, and infrastructure is useless without apps. Controlling an environment means controlling applications and infrastructure. Google has the answer: Site Reliability Engineering (SRE). SRE incorporates aspects of software engineering and applies them to infrastructure and operations problems.
How would that work in multi-cloud? After completing this chapter, you will have a good understanding of the concept behind SRE. You will learn that SRE is driven by risk analysis that determines the service-level objective (SLO). Next, the monitoring of the SLO is discussed, since reliability is something that can be measured, and for that, teams need observability. In the last section, the implementation of SRE is studied...