Chapter 2: SRE Technical Practices – Deep Dive
Reliability is the most critical feature of a service or a system and should be aligned with business objectives. This alignment should be tracked constantly, meaning that the alignment needs measurement. Site reliability engineering (SRE) prescribes specific technical tools or practices that will help in measuring characteristics that define and track reliability. These tools are service-level agreements (SLAs), service-level objectives (SLOs), service-level indicators (SLIs), and error budgets.
SLAs represent an external agreement with customers about the reliability of a service. SLAs should have consequences if violated (that is, the service doesn't meet the reliability expectations), and the consequences are often monetary in nature. To ensure SLAs are never violated, it is important to set thresholds. Setting these thresholds ensures that an incident is caught and potentially addressed before repeated occurrences of...