Working with risk analysis in SRE
The basis of SRE is that reliability is something that you can design as part of the architecture of applications and systems. Next to that, reliability is also something that one can measure. According to SRE, reliability is a measurable quality, and that quality can be influenced by design decisions. Engineers can take measures to decrease the detection, response, and repair time, and they can develop systems in such a way that changes can be executed safely without causing any downtime. Architects can design fault-tolerant systems; engineers can develop these.
The major issue is it all comes at a cost, and whether systems really need to be fault-tolerant is a business decision, based on a business case. Already in Chapter 1, Introduction to Multi-Cloud, we've learned that business cases are driven by risks. Let's go over risk management one more time.
The basic rule is that risk = probability x impact. Enterprises use risk management...