The notifications that alert you that something is wrong can be as vague as Something is wrong with the website. Well, that's not very useful for troubleshooting, detecting the root cause, and fixing it. This is especially true for microservice-based architectures where every user request can be handled by a large number of microservices and each component might fail in interesting ways. There are several ways to try and narrow down the scope:
- Look at recent deployments and configuration changes.
- Check whether any of your third-party dependencies suffered an outage.
- Consider similar issues if the root cause hasn't been fixed yet.
If you're lucky, you can just diagnose the problem right away. However, when debugging large-scale distributed systems, you don't really want to rely on luck. It's much better to have a methodical approach...