Key principles and practices
The SRE team’s day-to-day activities include developing and maintaining large, distributed services. Operating a service successfully in good health requires a wide range of activities, such as building monitoring systems, planning capacity, responding to incidents, resolving the root causes of outages, and so on.
This section covers the key principles and practices that influence the SRE team’s day-to-day activities. The following diagram depicts the elements necessary to make a service reliable, from the most basic to the most advanced:
Figure 2.3 – Service reliability hierarchy according to Google’s SRE book
From the most basic requirement to the capstone step of launching a product or service, Google has described the reliability hierarchy as necessary to boost the reliability of the system and maintain service health. Each level of this pyramid will be discussed briefly:
- Monitoring: Monitoring...