In real-life operations, the ability to quickly detect and debug a problem is critical. In this chapter, we will discuss the two most important tools we can use to discover what's happening in a production cluster processing a high number of requests. The first tool is logs, which help us to understand what's happening within a single request, while the other tool is metrics, which categorizes the aggregated performance of the system.
The following topics will be covered in this chapter:
- Observability of a live system
- Setting up logs
- Detecting problems through logs
- Setting up metrics
- Being proactive
By the end of this chapter, you'll know how to add logs so that they are available to detect problems and how to add and plot metrics and understand the differences between both of them.