Lessons from SRE
Google takes system reliability very seriously. In 2003, Google began making a shift from existing models of operations and support to a new, developer focused approach of reliability engineering known as Site Reliability Engineering (SRE). The results of this have been incredibly significant for Google, both for their internal products and for their cloud offerings. In recent years, Site Reliability Engineering has gained quite a bit of traction in the larger developer community, building upon the wake of the ongoing DevOps movement.
While the topic of Site Reliability Engineering is broad and extends far beyond the scope of this chapter, many key aspects of SRE are intimately related to the topics covered here, as well as topics covered in Chapter 12, Change Management. In fact, many of the tools available in Stackdriver are the same tools used internally by Google SREs. Google defines reliability as a function of mean-time-to-failure (MTTF) and mean-time-to-recovery (MTTR...