Performance metrics
When a microservice eats 100% of server memory, bad things will happen. Some Linux distributions will just kill the greedy process using the infamous out-of-memory killer (oomkiller).
Using too much RAM can happen for several reasons:
- The microservice has a memory leak and steadily grows, sometimes at a very fast pace. It's very common in Python C extensions to forget to dereference an object and leak it on every call.
- The code uses memory without care. For example, a dictionary that's used as an ad hoc memory cache can grow indefinitely over the days unless there's an upper limit by design.
- There's simply not enough memory allocated to the service--the server is getting too many requests or is too weak for the job.
It's important to be able to track memory usage over time to find out about these issues before it impacts users.
Reaching 100% of the CPU in production is also problematic. While it's desirable to maximize the CPU usage, if the server is too busy when new requests...