Summary
We examined in this chapter what happens when things go wrong in production. We began our chapter by looking at two incidents: the initial release of healthcare.gov in 2013 and the Atlassian cloud outage in 2022. We learned from both incidents the importance of prevention and planning for future incidents.
We then explored methods of preparation by looking at important parts of the discipline of SRE. SRE begins this process by setting the SLIs and SLOs so that we have an idea of the tolerance of risk through the error budget. SRE also looks at the process of releasing new changes and launching new products.
We looked at practicing for disaster through the discipline of chaos engineering. We understood the principles behind the discipline and how to create experiments through the Disasterpiece Theater process.
Ultimately, even with adequate preparation, production failures will still happen. We looked at the key parts of Google’s incident management process for...