Understanding error budgets
Once SLOs are set based on SLIs specific to user journeys that define system availability and reliability by quantifying users' expectations, it is important to understand how unreliable the service is allowed to be. This acceptable level of unreliability or unavailability is called an error budget.
The unavailability or unreliability of a service can be caused due to several reasons, such as planned maintenance, hardware failure, network failures, bad fixes, and new issues introduced while introducing new features.
Error budgets put a quantifiable target on the amount of unreliability that could be tracked. They create a common incentive between development and operations teams. This target is used to balance the urge to push new features (thereby adding innovation to the service) against ensuring service reliability.
An error budget is basically the inverse of availability, and it tells us how unreliable your service is allowed to be. If...