Incidents
Unfortunately, at some point in its life, the system won't behave as it should. It will produce an error so important that it needs to be taken care of immediately.
An incident is defined as a problem that disrupts the service so much that it requires an emergency response.
This doesn't necessarily mean that the full service is totally interrupted – it could be a noticeable degradation of the external service, or even a problem in one internal service that reduces the quality of service overall. For example, if an asynchronous task handler is failing 50% of the time, external customers may only see that their tasks take longer, but that is probably important enough to take corrective action.
During incidents, using all monitoring tools available is critical to find the problem as soon as possible and be able to correct it. Reaction times should be as fast as possible while keeping the risk of corrective actions as low as possible...