Alerting basics
No microservice operates without incidents; even if you have a stable, highly tested, and well-maintained service, it can still experience various types of issues, such as the following:
- Resource constraints: A host running the service may experience high CPU utilization or insufficient RAM or disk space.
- Network congestion: The service may experience a sudden increase in load or decreased performance in any of its dependencies. This could limit its ability to process incoming requests or operate at the expected performance level.
- Dependency failures: Other services or libraries that your service is depending on may experience various issues, affecting your service execution.
Such issues can be self-resolving. For example, a slower network throughput could be a transient issue caused by temporary maintenance or a network device being restarted. Many other types of issues, which we call incidents, require some actions from the engineers to be mitigated...