Managing Incidents Using Alerts
This chapter will explore the concepts of incident management. We will discuss how to build a world-class incident management process, which treats those responding to incidents humanely and avoids burnout. The chapter will establish the responsibilities for this, from the senior leadership teams to the engineers responding to the callout. It will introduce the important concepts of building an organization that can handle incidents and excel at providing customers with a stable experience. With the process established, we’ll explain how to consider a service and pick critical measures that can be used to see the current service level, without being drowned out by noise.
This chapter will also explore the three tools available from Grafana for incident management. First, there’s Grafana Alerting, which is used to monitor metrics and logs for failures and trigger notifications to responding teams. Then, there’s Grafana OnCall,...