Points to remember
Here are some important points to remember:
- 100% is an unrealistic reliability target.
- Log-based SLIs and ingesting telemetry adds latency.
- App metrics are not good for complex use journeys.
- SLOs must be set based on conversations with engineering and product teams.
- If there is no error budget left, the focus should be on reliability.
- TTD is the time taken to identify that an issue exists or is reported.
- TTR is the time taken to resolve an issue.
- To improve the reliability of a service, reduce TTD, reduce TTR, reduce impact %, and increase TTF/TBF.
- SLIs should have a predictable relationship with user happiness and should be aggregated over time.
- User expectations are strongly tied to past performance.
- Setting values for SLIs and SLOs should be an iterative process.
- Advanced techniques to manage error budgets are dynamic release cadence, setting up error budget exhaustion rates, rainy-day funds, and the use of silver...