Performance analysis overview
Now that you know the core concepts around distributed tracing, let’s see how we can use the observability stack to investigate common distributed system problems.
The baseline
Before we talk about problems, let’s establish a baseline representing the behavior of a healthy system. We also need it to make data-driven decisions to help with common design and development tasks such as the following:
- Risk estimation: Any feature work on the hot path is a good candidate for additional performance testing prior to release and guarding new code with feature flags.
- Capacity planning: Knowing the current load is necessary to understand whether a system can handle planned growth and new features.
- Understand improvement potential: It makes more sense to optimize frequently executed code, as even small optimizations bring significant performance gains or cost reductions. Similarly, improving reliability brings the most benefits for components that have a higher error rate and that are used by other services.
- Learning usage patterns: Depending on how users interact with your system, you might change your scaling or caching strategy, extract specific functionality to a new service, or merge services.
Generic indicators that describe the performance of each service include the following:
- Latency: How fast a service responds
- Throughput: How many requests, events, or bytes the service is handling per second
- Error rate: How many errors a service returns
Your system might need other indicators to measure durability or data correctness.
Each of these signals is useful when it includes an API route, a status code, and other context properties. For example, the error rate could be low overall but high for specific users or API routes.
Measuring signals on the server and client sides, whenever possible, gives you a better picture. For example, you can detect network failures and avoid “it works on my machine” situations when clients see issues and servers don’t.
Investigating performance issues
Let’s divide performance issues into two overlapping categories:
- Widespread issues that affect a whole instance, server, or even the system, and move the distribution median.
- An individual request or job that takes too much time to complete. If we visualize the latency distribution, as shown in Figure 1.12, we’ll see such issues in the long tail of distribution – they are rare, but part of normal behavior.
Figure 1.12 – Azure Monitor latency distribution visualization, with a median request (the 50th percentile) taking around 80 ms and the 95th percentile around 250 ms
Long tails
Individual issues can be caused by an unfortunate chain of events – transient network issues, high contention in optimistic concurrency algorithms, hardware failures, and so on.
Distributed tracing is an excellent tool to investigate such issues. If you have a bug report, you might have a trace context for a problematic operation. To achieve it, make sure you show the traceparent
value on the web page and return traceresponse
or a document that users need to record, or log traceresponse
when sending requests to your service.
So, if you know the trace context, you can start by checking the trace view. For example, in Figure 1.13, you can see an example of a long request caused by transient network issues.
Figure 1.13 – A request with high latency caused by transient network issues and retries
The frontend request took about 2.6 seconds and the time was spent on the storage service downloading meme content. We see three tries of Azure.Core.Http.Request
, each of which was fast, and the time between them corresponds to the back-off interval. The last try was successful.
If you don’t have trace-id
, or perhaps if the trace was sampled out, you might be able to filter similar operations based on the context and high latency.
For example, in Jaeger, you can filter spans based on the service, span name, attributes, and duration, which helps you to find a needle in a haystack.
In some cases, you will end up with mysterious gaps – the service was up and running but spent significant time doing nothing, as shown in Figure 1.14:
Figure 1.14 – A request with high latency and gaps in spans
If you don’t get enough data from traces, check whether there are any logs available in the scope of this span.
You might also check resource utilization metrics – was there a CPU spike, or maybe a garbage collection pause at this moment? You might find some correlation using timestamps and context, but it’s impossible to tell whether this was a root cause or a coincidence.
If you have a continuous profiler that correlates profiles to traces (yes, they can do it with Activity.Current
), you can check whether there are profiles available for this or similar operations.
We’ll see how to investigate this further with .NET diagnostics tools in Chapter 4, Low-Level Performance Analysis with Diagnostic Tools, but if you’re curious about what happened in Figure 1.14, the service read a network stream that was not instrumented.
Even though we talk about individual performance issues, in many cases we don’t know how widespread they are, especially when we’re at the beginning of an incident. Metrics and rich queries across traces can be used to find out how common a problem is. If you’re on call, checking whether an issue is widespread or becoming more frequent is usually more urgent than finding the root cause.
Note
Long-tail latency requests are inevitable in distributed systems, but there are always opportunities for optimization, with caching, collocation, adjusting timeouts and the retry policy, and so on. Monitoring P95 latency and analyzing traces for long-tail issues helps you find such areas for improvement.
Performance issues
Performance problems manifest as latency or throughput degradation beyond usual variations. Assuming you fail fast or rate-limit incoming calls, you might also see an increase in the error rate for 408
, 429
, or 503
HTTP status codes.
Such issues can start as a slight decrease in dependency availability, causing a service to retry. With outgoing requests taking more resources than usual, other operations slow down, and the time to process client requests grows, along with number of active requests and connections.
It could be challenging to understand what happened first; you might see high CPU usage and a relatively high GC rate – all symptoms you would usually see on an overloaded system, but nothing that stands out. Assuming you measure the dependency throughput and error rate, you could see the anomaly there, but it might be difficult to tell whether it’s a cause or effect.
Individual distributed traces are rarely useful in such cases – each operation takes longer, and there are more transient errors, but traces may look normal otherwise.
Here’s a list of trivial things to check first, and they serve as a foundation for more advanced analysis:
- Is there an active deployment or a recent feature rollout? You can find out whether a problem is specific to instances running a new version of code using a
service.version
resource attribute. If you include feature flags on your traces or events, you can query them to check whether degradation is limited to (or started from) the requests with a new feature enabled. - Are issues specific to a certain API, code path, or combination of attributes? Some backends, such as Honeycomb, automate this analysis, finding attributes corresponding to a higher latency or error rate.
- Are all instances affected? How many instances are alive? Attribute-based analysis is helpful here too.
- Are your dependencies healthy? If you can, check their server-side telemetry and see whether they experience problems with other services, not just yours.
Attribute analysis can help here as well – assuming just one of your cloud storage accounts or database partitions is misbehaving, you will see it.
- Did the load increase sharply prior to the incident? Or, if your service is auto-scaled, is the auto-scaler functioning properly, and are you able to catch up with the load?
There are more questions to ask about infrastructure, the cloud provider, and other aspects. The point of this exercise is to narrow down and understand the problem as much as possible. If the problem is not in your code, investigation helps to find a better way to handle problems like these in the future and gives you an opportunity to fill the gaps in your telemetry, so next time something similar happens, you can identify it faster.
If you suspect a problem in your code, .NET provides a set of signals and tools to help investigate high CPU, memory leaks, deadlocks, thread pool starvation, and profile code, as we’ll see in Chapter 4, Low-Level Performance Analysis with Diagnostic Tools.