Profiling
If we analyze individual traces corresponding to thread pool starvation or memory leaks, we will not see anything special. They are fast under a small load and get slower or fail when the load increases.
However, some performance issues only affect certain scenarios, at least under typical load. Locks and inefficient code are examples of such operations.
We rarely instrument local operations with distributed tracing under the assumption that local calls are fast and exceptions have enough information for us to investigate failures.
But what happens when we have compute-heavy or just inefficient code in the service? If we look at distributed traces, we’ll see high latency and gaps between spans, but we wouldn’t know why it happens.
We know ahead of time that some operations, such as complex algorithms or I/O, can take a long time to complete or fail, so we can deliberately instrument them with tracing or just write a log record. But we rarely introduce...