Error budgets
As defined by Liz Fong-Jones and Seth Vargo, error budgets represent “a quantitative measure shared between product and SRE teams to balance innovation and stability.”
In simpler terms, an error budget quantifies the level of risk that can be taken to introduce new features, conduct service maintenance, perform routine enhancements, manage network and infrastructure disruptions, and respond to unforeseen situations. Typically, the monitoring system measures the uptime of your service, while SLOs establish the target you aim to achieve. The error budget is the difference between these two metrics and represents the time available to deploy new releases, provided it falls within the error budget limits.
This is precisely why a 100% SLO is not usually set initially. Error budgets serve the crucial purpose of helping teams strike a balance between innovation and reliability. The rationale behind error budgets lies in the SRE perspective that failures are...