Incident management
Incident management is one of the key roles of an SRE engineer. An incident is defined as an event that indicates the possibility of an issue with respect to a service or an application. The nature of the issue can be minor in nature in the best case or, in contrast, can be an outage in the worst case. An incident can be triggered by an alert that was set up as part of monitoring the service or application.
An alert is an indication that SLO objectives with respect to the service are being violated or are on track to be violated. Sometimes, and specifically for an external-facing application, an incident can be triggered by an end user complaining via social media platforms. Such incidents include an additional layer of retrospection on how or why the current alerting system put in place failed to identify the incident.
Effective incident management is a critical SRE cultural practice that is key to limiting the disruption caused by an incident and is critical...