Monitoring is one part of operational excellence functioning; the other part involves handing alerts and acting upon them. Using alerts, you can define the system threshold and when you want to work. For example, if the server CPU utilization reaches 70% for 5 minutes, then the monitoring tool records high server utilization and sends an alert to the operations team to take action to bring down CPU utilization before a system crash. Responding to this incident, the operations team can add the server manually. When automation is in place, autoscaling triggers the alert to add more servers as per demand. It also sends a notification to the operations team, which can be addressed later.
Often, you need to define the alert category, and the operations team prepares for the response as per the alert severity. The following levels of severity provide an example of how to categorize alert priority:
- Severity 1: Sev1 is a critical priority issue. A Sev1 issue...