Learning site reliability best practices
In this section, we will learn about considerations and best practices followed by the industry site reliability experts that handle technical site availability issues when observed.
Site Reliability Engineering (SRE) is a discipline introduced by the Google engineering team. Google's approach of operating their core services at scale still represents a model for SRE best practices today. You can read more about the foundations and practices on the Google SRE resources site at https://sre.google/resources/. Before we learn about the monitoring and metric visualization tools, let's learn about a few common-sense SRE best practices we should consider:
- Automate everything possible and automate now: SREs should take every opportunity to automate time-consuming infrastructure tasks. As part of a DevOps culture, SREs work with autonomous teams choosing their own services, which makes the unification of tools almost impossible...