Problem-solving – enabling recovery
For SREs, a solid incident management process is important when things go wrong in production. A good incident management process allows you to follow these necessary goals, commonly referred to as the three Cs:
- Coordinate the response
- Communicate between the incident participants, others in the organization, and interested parties in the outside world
- Maintain control over the incident response
Google identified necessary elements to their incident command system in the Managing Incidents chapter written by Andrew Stribblehill in Site Reliability Engineering: How Google Runs Production Systems. These elements include the following:
- Clearly defined incident management roles
- A (virtual or physical) command post
- A living incident state document
- Clear handoffs to others
Let’s look at these elements in detail.
Incident management roles
Upon recognition that what you are facing is truly...