Modern Distributed Tracing in .NET

Observability Needs of Modern Applications

With the increasing complexity of distributed systems, we need better tools to build and operate our applications. Distributed tracing is one such technique that allows you to collect structured and correlated telemetry with minimum effort and enables observability vendors to build powerful analytics and automation.

In this chapter, we’ll explore common observability challenges and see how distributed tracing brings observability to our systems where logs and counters can’t. We’ll see how correlation and causation along with structured and consistent telemetry help answer arbitrary questions about the system and mitigate issues faster.

Here’s what you will learn:

An overview of monitoring techniques using counters, logs, and events
Core concepts of distributed tracing – the span and its structure
Context propagation standards
How to generate meaningful and consistent telemetry
How to use distributed tracing along with metrics and logs for performance analysis and debugging

By the end of this chapter, you will become familiar with the core concepts and building blocks of distributed tracing, which you will be able to use along with other telemetry signals to debug functional issues and investigate performance issues in distributed applications.

Understanding why logs and counters are not enough

Monitoring and observability cultures vary across the industry; some teams use ad hoc debugging with printf while others employ sophisticated observability solutions and automation. Still, almost every system uses a combination of common telemetry signals: logs, events, metrics or counters, and profiles. Telemetry collection alone is not enough. A system is observable if we can detect and investigate issues, and to achieve this, we need tools to store, index, visualize, and query the telemetry, navigate across different signals, and automate repetitive analysis.

Before we begin exploring tracing and discovering how it helps, let’s talk about other telemetry signals and their limitations.

Logs

A log is a record of some event. Logs typically have a timestamp, level, class name, and formatted message, and may also have a property bag with additional context.

Logs are a low-ceremony tool, with plenty of logging libraries and tools for any ecosystem.

Common problems with logging include the following:

Verbosity: Initially, we won’t have enough logs, but eventually, as we fill gaps, we will have too many. They become hard to read and expensive to store.
Performance: Logging is a common performance issue even when used wisely. It’s also very common to serialize objects or allocate strings for logging even when the logging level is disabled.

One new log statement can take your production down; I did it once. The log I added was written every millisecond. Multiplied by a number of service instances, it created an I/O bottleneck big enough to significantly increase latency and the error rate for users.

Not queryable: Logs coming from applications are intended for humans. We can add context and unify the format within our application and still only be able to filter logs by context properties. Logs change with every refactoring, disappear, or become out of date. New people joining a team need to learn logging semantics specific to a system, and the learning curve can be steep.
No correlation: Logs for different operations are interleaved. The process of finding logs describing certain operations is called correlation. In general, log correlation, especially across services, must be implemented manually (spoiler: not in ASP.NET Core).

Note

Logs are easy to produce but are verbose, and then can significantly impact performance. They are also difficult to filter, query, or visualize.

To be accessible and useful, logs are sent to some central place, a log management system, which stores, parses, and indexes them so they can be queried. This implies that your logs need to have at least some structure.

ILogger in .NET supports structured logging, as we’ll see in Chapter 8, Writing Structured and Correlated Logs, so you get the human-readable message, along with the context. Structured logging, combined with structured storage and indexing, converts your logs into rich events that you can use for almost anything.

Events

An event is a structured record of something. It has a timestamp and a property bag. It may have a name, or that could just be one of the properties.

The difference between logs and events is semantical – an event is structured and usually follows a specific schema.

For example, an event that describes adding an item to a shopping bag should have a well-known name, such as shopping_bag_add_item with user-id and item-id properties. Then, you can query them by name, item, and user. For example, you can find the top 10 popular items across all users.

If you write it as a log message, you’d probably write something like this:

logger.LogInformation("Added '{item-id}' to shopping bag
  for '{user-id}'", itemId, userId)

If your logging provider captures individual properties, you would get the same context as with events. So, now we can find every log for this user and item, which probably includes other logs not related to adding an item.

Note

Events with consistent schema can be queried efficiently but have the same verbosity and performance problems as logs.

Metrics and counters

Logs and events share the same problem – verbosity and performance overhead. One way to solve them is aggregation.

A metric is a value of something aggregated by dimensions and over a period of time. For example, a request latency metric can have an HTTP route, status code, method, service name, and instance dimensions.

Common problems with metrics include the following:

Cardinality: Each combination of dimensions is a time series, and aggregation happens within one time series. Adding a new dimension causes a combinatorial explosion, so metrics must have low cardinality – that is, they cannot have too many dimensions, and each one must have a small number of distinct values. As a result, you can’t measure granular things such as per-user experience with metrics.
No causation: Metrics only show correlation and no cause and effect, so they are not a great tool to investigate issues.

As an expert on your system, you might use your intuition to come up with possible reasons for certain types of behavior and then use metrics to confirm your hypothesis.

Verbosity: Metrics have problems with verbosity too. It’s common to add metrics that measure just one thing, such as queue_is_full or queue_is_empty. Something such as queue_utilization would be more generic. Over time, the number of metrics grows along with the number of alerts, dashboards, and team processes relying on them.

Note

Metrics have low impact on performance, low volume that doesn’t grow much with scale, low storage costs, and low query time. They are great for dashboards and alerts but not for issue investigation or granular analytics.

A counter is a single time series – it’s a metric without dimensions, typically used to collect resource utilization such as CPU load or memory usage. Counters don’t work well for application performance or usage, as you need a dedicated counter per each combination of attributes, such as HTTP route, status code, and method. It is difficult to collect and even harder to use. Luckily, .NET supports metrics with dimensions, and we will discuss them in Chapter 7, Adding Custom Metrics.

What’s missing?

Now you know all you need to monitor a monolith or small distributed system – use metrics for system health analysis and alerts, events for usage, and logs for debugging. This approach has taken the tech industry far, and there is nothing essentially wrong with it.

With up-to-date documentation, a few key performance and usage metrics, concise, structured, correlated, and consistent events, common conventions, and tools across all services, anyone operating your system can do performance analysis and debug issues.

Note

So, the ultimate goal is to efficiently operate a system, and the problem is not a specific telemetry signal or its limitations but a lack of standard solutions and practices, correlation, and structure for existing signals.

Before we jump into distributed tracing and see how its ecosystem addresses these gaps, let’s summarize the new requirements we have for the perfect observability solution we intend to solve with tracing and the new capabilities it brings. Also, we should keep in mind the old capabilities – low-performance overhead and manageable costs.

Systematic debugging

We need to be able to investigate issues in a generic way. From an error report to an alert on a metric, we should be able to drill down into the issue, follow specific requests end to end, or bubble up from an error deep in the stack to understand its effect on users.

All this should be reasonably easy to do when you’re on call and paged at 2AM to resolve an incident in production.

Answering ad hoc questions

I might want to understand whether users from Redmond, WA, who purchased a product from my website are experiencing longer delivery times than usual and why – because of the shipment company, rain, cloud provider issues in this region, or anything else.

It should not be required to add more telemetry to answer most of the usage or performance questions. Occasionally, you’d need to add a new context property or an event, but it should be rare on a stable code path.

Self-documenting systems

Modern systems are dynamic – with continuous deployments, feature flag changes in runtime, and dozens of external dependencies with their own instabilities, nobody can know everything.

Telemetry becomes your single source of truth. Assuming it has enough context and common semantics, an observability vendor should be able to visualize it reasonably well.

Auto-instrumentation

It’s difficult to instrument everything in your system – it’s repetitive, error-prone, and hard to keep up to date, test, and enforce common schema and semantics. We need shared instrumentations for common libraries, while we would only add application-specific telemetry and context.

With an understanding of these requirements, we will move on to distributed tracing.

Introducing distributed tracing

Distributed tracing is a technique that brings structure, correlation and causation to collected telemetry. It defines a special event called span and specifies causal relationships between spans. Spans follow common conventions that are used to visualize and analyze traces.

Span

A span describes an operation such as an incoming or outgoing HTTP request, a database call, an expensive I/O call, or any other interesting call. It has just enough structure to represent anything and still be useful. Here are the most important span properties:

The span’s name should describe the operation type in human-readable format, have low cardinality, and be human-readable.
The span’s start time and duration.
The status indicates success, failure, or no status.
The span kind distinguishes the client, server, and internal calls, or the producer and consumer for async scenarios.
Attributes (also known as tags or annotations) describe specific operations.
Span context identifies spans and is propagated everywhere, enabling correlation. A parent span identifier is also included on child spans for causation.
Events provide additional information about operations within a span.
Links connect traces and spans when parent-child relationships don’t work – for example, for batching scenarios.

Note

In .NET, the tracing span is represented by System.Diagnostics.Activity. The System.Span class is not related to distributed tracing.

Relationships between spans

A span is a unit of tracing, and to trace more complex operations, we need multiple spans.

For example, a user may attempt to get an image and send a request to the service. The image is not cached, and the service requests it from the cold storage (as shown in Figure 1.1):

Figure 1.1 – A GET image request flow

To make this operation debuggable, we should report multiple spans:

The incoming request
The attempt to get the image from the cache
Image retrieval from the cold storage
Caching the image

These spans form a trace – a set of related spans fully describing a logical end-to-end operation sharing the same trace-id. Within the trace, each span is identified by span-id. Spans include a pointer to a parent span – it’s just their parent’s span-id.

trace-id, span-id, and parent-span-id allow us to not only correlate spans but also record relationships between them. For example, in Figure 1.2, we can see that Redis GET, SETEX, and HTTP GET spans are siblings and the incoming request is their parent:

Figure 1.2 – Trace visualization showing relationships between spans

Spans can have more complicated relationships, which we’ll talk about later in Chapter 6, Tracing Your Code.

Span context (aka trace-id and span-id) enables even more interesting cross-signal scenarios. For example, you can stamp parent span context on logs (spoiler: just configure ILogger to do it) and you can correlate logs to traces. For example, if you use ConsoleProvider, you will see something like this:

Figure 1.3 – Logs include span context and can be correlated to other signals

You could also link metrics to traces using exemplars – metric metadata containing the trace context of operations that contributed to a recorded measurement. For instance, you can check examples of spans that correspond to the long tail of your latency distribution.

Attributes

Span attributes are a property bag that contains details about the operation.

Span attributes should describe this specific operation well enough to understand what happened. OpenTelemetry semantic conventions specify attributes for popular technologies to help with this, which we’ll talk about in the Ensuring consistency and structure section later in this chapter.

For example, an incoming HTTP request is identified with at least the following attributes: the HTTP method, path, query, API route, and status code:

Figure 1.4 – The HTTP server span attributes

Instrumentation points

So, we have defined a span and its properties, but when should we create spans? Which attributes should we put on them? While there is no strict standard to follow, here’s the rule of thumb:

Create a new span for every incoming and outgoing network call and use standard attributes for the protocol or technology whenever available.

This is what we’ve done previously with the memes example, and it allows us to see what happened on the service boundaries and detect common problems: dependency issues, status, latency, and errors on each service. This also allows us to correlate logs, events, and anything else we collect. Plus, observability backends are aware of HTTP semantics and will know how to interpret and visualize your spans.

There are exceptions to this rule, such as socket calls, where requests could be too small to be instrumented. In other cases, you might still be rightfully concerned with verbosity and the volume of generated data – we’ll see how to control it with sampling in Chapter 5, Configuration and Control Plane.

Tracing – building blocks

Now that you are familiar with the core concepts of tracing and its methodology, let’s talk about implementation. We need a set of convenient APIs to create and enrich spans and pass context around. Historically, every Application Performance Monitoring (APM) tool had its own SDKs to collect telemetry with their own APIs. Changing the APM vendor meant rewriting all your instrumentation code.

OpenTelemetry solves this problem – it’s a cross-language telemetry platform for tracing, metrics, events, and logs that unifies telemetry collection. Most of the APM tools, log management, and observability backends support OpenTelemetry, so you can change vendors without rewriting any instrumentation code.

.NET tracing implementation conforms to the OpenTelemetry API specification, and in this book, .NET tracing APIs and OpenTelemetry APIs are used interchangeably. We’ll talk about the difference between them in Chapter 6, Tracing Your Code.

Even though OpenTelemetry primitives are baked into .NET and the instrumentation code does not depend on them, to collect telemetry from the application, we still need to add the OpenTelemetry SDK, which has everything we need to configure a collection and an exporter. You might as well write your own solution compatible with .NET tracing APIs.

OpenTelemetry became an industry standard for tracing and beyond; it’s available in multiple languages, and in addition to a unified collection of APIs it provides configurable SDKs and a standard wire format for the telemetry – OpenTelemetry protocol (OTLP). You can send telemetry to any compatible vendor, either by adding a specific exporter or, if the backend supports OTLP, by configuring the vendor’s endpoint.

As shown in Figure 1.5, the application configures the OpenTelemetry SDK to export telemetry to the observability backend. Application code, .NET libraries, and various instrumentations use .NET tracing APIs to create spans, which the OpenTelemetry SDK listens to, processes, and forwards to an exporter.

Figure 1.5 – Tracing building blocks

So, OpenTelemetry decouples instrumentation code from the observability vendor, but it does much more than that. Now, different applications can share instrumentation libraries and observability vendors have unified and structured telemetry on top of which they can build rich experiences.

Instrumentation

Historically, all APM vendors had to instrument popular libraries: HTTP clients, web frameworks, Entity Framework, SQL clients, Redis client libraries, RabbitMQ, cloud providers’ SDKs, and so on. That did not scale well. But with .NET tracing APIs and OpenTelemetry semantics, instrumentation became common for all vendors. You can find a growing list of shared community instrumentations in the OpenTelemetry Contrib repo: https://github.com/open-telemetry/opentelemetry-dotnet-contrib.

Moreover, since OpenTelemetry is a vendor-neutral standard and baked into .NET, it’s now possible for libraries to implement native instrumentation – HTTP and gRPC clients, ASP.NET Core, and several other libraries support it.

Even with native tracing support, it’s off by default – you need to install and register specific instrumentation (which we’ll cover in Chapter 2, Native Monitoring in .NET). Otherwise, tracing code does nothing and, thus, does not add any performance overhead.

Backends

The observability backend (aka monitoring, APM tool, and log management system) is a set of tools responsible for ingestion, storage, indexing, visualization, querying, and probably other things that help you monitor your system, investigate issues, and analyze performance.

Observability vendors build these tools and provide rich user experiences to help you use traces along with other signals.

Collecting traces for common libraries became easy with the OpenTelemetry ecosystem. As you’ll see in Chapter 2, Native Monitoring in .NET, most of it can be done automatically with just a few lines of code at startup. But how do we use them?

While you can send spans to stdout and store them on the filesystem, this would not leverage all tracing benefits. Traces can be huge, but even when they are small, grepping them is not convenient.

Tracing visualizations (such as a Gantt chart, trace viewer, or trace timeline) is one of the common features tracing providers have. Figure 1.6 shows a trace timeline in Jaeger – an open source distributed tracing platform:

Figure 1.6 – Trace visualization in Jaeger with errors marked with exclamation point

While it may take a while to find an error log, the visualization shows what’s important – where failures are, latency, and a sequence of steps. As we can see in Figure 1.6, the frontend call failed because of failure on the storage side, which we can further drill into.

However, we can also see that the frontend made four consecutive calls into storage, which potentially could be done in parallel to speed things up.

Another common feature is filtering or querying by any of the span properties such as name, trace-id, span-id, parent-id, name, attribute name, status, timestamp, duration, or anything else. An example of such a query is shown in Figure 1.7:

Figure 1.7 – A custom Azure Monitor query that calculates the Redis hit rate

For example, we don’t report a metric for the cache hit rate, but we can estimate it from traces. While they’re not precise because of sampling and might be more expensive to query than metrics, we can still do it ad hoc, especially when we investigate specific failures.

Since traces, metrics, and logs are correlated, you will fully leverage observability capabilities if your vendor supports multiple signals or integrates well with other tools.

Reviewing context propagation

Correlation and causation are the foundation of distributed tracing. We’ve just covered how related spans share the same trace-id and have a pointer to the parent recorded in parent-span-id, forming a casual chain of operations. Now, let’s explore how it works in practice.

In-process propagation

Even within a single service, we usually have nested spans. For example, if we trace a request to a REST service that just reads an item from a database, we’d want to see at least two spans – one for an incoming HTTP request and another for a database query. To correlate them properly, we need to pass span context from ASP.NET Core to the database driver.

One option is to pass context explicitly as a function argument. It’s a viable solution in Go, where explicit context propagation is a standard, but in .NET, it would make onboarding onto distributed tracing difficult and would ruin the auto-instrumentation magic.

.NET Activity (aka the span) is propagated implicitly. Current activity can always be accessed via the Activity.Current property, backed up by System.Threading.AsyncLocal<T>.

Using our previous example of a service reading from the database, ASP.NET Core creates an Activity for the incoming request, and it becomes current for anything that happens within the scope of this request. Instrumentation for the database driver creates another one that uses Activity.Current as its parent, without knowing anything about ASP.NET Core and without the user application passing the Activity around. The logging framework would stamp trace-id and span-id from Activity.Current, if configured to do so.

It works for sync or async code, but if you process items in the background using in-memory queues or manipulate with threads explicitly, you would have to help runtime and propagate activities explicitly. We’ll talk more about it in Chapter 6, Tracing Your Code.

Out-of-process propagation

In-process correlation is awesome, and for monolith applications, it would be almost sufficient. But in the microservice world, we need to trace requests end to end and, therefore, propagate context over the wire, and here’s where standards come into play.

You can find multiple practices in this space – every complex system used to support something custom, such as x-correlation-id or x-request-id. You can find x-cloud-trace-context or grpc-trace-bin in old Google systems, X-Amzn-Trace-Id on AWS, and Request-Id variations and ms-cv in the Microsoft ecosystem. Assuming your system is heterogeneous and uses a variety of cloud providers and tracing tools, correlation is difficult.

Trace context (which you can explore in more detail at https://www.w3.org/TR/trace-context) is a relatively new standard, converting context propagation over HTTP, but it’s widely adopted and used by default in OpenTelemetry and .NET.

W3C Trace Context

The trace context standard defines traceparent and tracestate HTTP headers and the format to populate context on them.

The traceparent header

The traceparent is an HTTP request header that carries the protocol version, trace-id, parent-id, and trace-flags in the following format:

traceparent: {version}-{trace-id}-{parent-id}-{trace-flags}

version: The protocol version – only 00 is defined at the moment.
trace-id: The logical end-to-end operation ID.
parent-id: Identifies the client span and serves as a parent for the corresponding server span.
trace-flags: Represents the sampling decision (which we’ll talk about in Chapter 5, Configuration and Control Plane). For now, we can determine that 00 indicates that the parent span was sampled out and 01 means it was sampled in.

All identifiers must be present – that is, traceparent has a fixed length and is easy to parse. Figure 1.8 shows an example of context propagation with the traceparent header:

Figure 1.8 – traceparent is populated from the outgoing span context and becomes a parent for the incoming span

Note

The protocol does not require creating spans and does not specify instrumentation points. Common practice is to create spans per outgoing and incoming requests, and put client span context into request headers.

The tracestate header

The tracestate is another request header, which carries additional context for the tracing tool to use. It’s designed for OpenTelemetry or an APM tool to carry additional control information and not for application-specific context (covered in detail later in the Baggage section).

The tracestate consists of a list of key-value pairs, serialized to a string with the following format: "vendor1=value1,vendor2=value2".

The tracestate can be used to propagate incompatible legacy correlation IDs, or some additional identifiers vendor needs.

OpenTelemetry, for example, uses it to carry a sampling probability and score. For example, tracestate: "ot=r:3;p:2" represents a key-value pair, where the key is ot (OpenTelemetry tag) and the value is r:3;p:2.

The tracestate header has a soft limitation on size (512 characters) and can be truncated.

The traceresponse (draft) header

Unlike traceparent and tracestate, traceresponse is a response header. At the time of writing, it’s defined in W3C Trace-Context Level 2 (https://www.w3.org/TR/trace-context-2/) and has reached W3C Editor’s Draft status. There is no support for it in .NET or OpenTelemetry.

traceresponse is very similar to traceparent. It has the same format, but instead of client-side identifiers, it returns the trace-id and span-id values of the server span:

traceresponse: 00-{trace-id}-{child-id}-{trace-flags}

traceresponse is optional in the sense that the server does not need to return it, even if it supports W3C Trace-Context Level 2. It’s useful to return traceresponse when the client did not pass a valid traceparent, but can log traceresponse.

External-facing services may decide to start a new trace, because they don’t trust the caller’s trace-id generation algorithm. Uniform random distribution is one concern; another reason could be a special trace-id format. If the service restarts a trace, it’s a good idea to return the traceresponse header to caller.

B3

The B3 specification (https://github.com/openzipkin/b3-propagation) was adopted by Zipkin – one of the first distributed tracing systems.

B3 identifiers can be propagated as a single b3 header in the following format:

b3: {trace-id}-{span-id}-{sampling-state}-{parent-span-id}

Another way is to pass individual components, using X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId, and X-B3-Sampled.

The sampling state suggests whether a service should trace the corresponding request. In addition to 0 (don’t record) and 1 (do record), it allows us to force tracing with a flag set to d. It’s usually done for debugging purposes. The sampling state can be passed without other identifiers to specify the desired sampling decision to the service.

Note

The key difference with W3C Trace-Context, beyond header names, is the presence of both span-id and parent-span-id. B3 systems can use the same span-id on the client and server sides, creating a single span for both.

Zipkin reuses span-id from the incoming request, also specifying parent-span-id on it. The Zipkin span represents the client and server at the same time, as shown in Figure 1.9, recording different durations and statuses for them:

Figure 1.9 – Zipkin creates one span to represent the client and server

OpenTelemetry and .NET support b3 headers but ignore parent-span-id – they generate a new span-id for every span, as it’s not possible to reuse span-id (see Figure 1.10).

Figure 1.10 – OpenTelemetry does not use parent-span-id from B3 headers and creates different spans for the client and server

Baggage

So far, we have talked about span context and correlation. But in many cases, distributed systems have application-specific context. For example, you authorize users on your frontend service, and after that, user-id is not needed for application logic, but you still want to add it as an attribute on spans from all services to query and aggregate it on a per-user basis.

You can stamp user-id once on the frontend. Then, spans recorded on the backend will not have user-id, but they will share the same trace-id as the frontend. So, with some joins in your queries, you can still do per-user analysis. It works to some extent but may be expensive or slow, so you might decide to propagate user-id and stamp it on the backend spans too.

The baggage (https://www.w3.org/TR/baggage/) defines a generic propagation format for distributed context, and you can use it for business logic or anything else by adding, reading, removing, and modifying baggage members. For example, you can route requests to the test environment and pass feature flags or extra telemetry context.

Baggage consists of a list of semicolon-separated members. Each member has a key, value, and optional properties in the following formats – key=value;property1;key2=property2 or key=value;property1;key2=property2,anotherKey=anotherValue.

OpenTelemetry and .NET only propagate baggage, but don’t stamp it on any telemetry. You can configure ILogger to stamp baggage and need to enrich traces explicitly. We’ll see how it works in Chapter 5, Configuration and Control Plane.

Tip

You should not put any sensitive information in baggage, as it’s almost impossible to guarantee where it would flow – your application or sidecar infrastructure can forward it to your cloud provider or anywhere else.

Maintain a list of well-known baggage keys across your system and only use known ones, as you might receive baggage from another system otherwise.

Baggage specification has a working draft status and may still change.

Note

While the W3C Trace Context standard is HTTP-specific and B3 applies to any RPC calls, they are commonly used for any context propagation needs – for example, they are passed as the event payload in messaging scenarios. This may change once protocol-specific standards are introduced.

Ensuring consistency and structure

As we already defined, spans are structured events describing interesting operations.

A span’s start time, duration, status, kind, and context are strongly typed – they enable correlation and causation, allowing us to visualize traces and detect failures or latency issues.

The span’s name and attributes describe an operation but are not strongly typed or strictly defined. If we don’t populate them in a meaningful way, we can detect an issue but have no knowledge of what actually happened.

For example, for client HTTP calls, beyond generic properties, we want to capture at least the URL, method, and response code (or exception) – if we don’t know any of these, we’re blind. Once we populate them, we can start doing some powerful analysis with queries over such spans to answer the following common questions:

Which dependency calls were made in the scope of this request? Which of them failed? What was the latency of each of them?
Does my application make independent dependency calls in parallel or sequentially? Does it make any unnecessary requests when they can be done lazily?
Are dependency endpoints configured correctly?
What are the success or error rates and latency per dependency API?

Note

This analysis relies on an application using the same attributes for all HTTP dependencies. Otherwise, the operator that performs the queries will have a hard time writing and maintaining them.

With unified and community-driven telemetry collection taken off the observability vendor’s plate, they can now fully focus on (semi-)automating analysis and giving us powerful performance and fault analysis tools.

OpenTelemetry defines a set of semantic conventions for spans, traces, and resources, which we’ll talk more about in Chapter 9, Best Practices.

Building application topology

Distributed tracing, combined with semantic conventions, allows us to build visualizations such as an application map (aka service map), as shown in Figure 1.11 – you could see your whole system along with key health metrics. It’s an entry point to any investigation.

Figure 1.11 – An Azure Monitor application map for a meme service is an up-to-date system diagram with all the basic health metrics

Observability vendors depend on trace and metrics semantics to build service maps. For example, the presence of HTTP attributes on the client span represents an outgoing HTTP call, and we need to show the outgoing arrow to a new dependency node. We should name this node based on the span’s host attribute.

If we see the corresponding server span, we can now merge the server node with the dependency node, based on span context and causation. There are other visualizations or automation tools that you might find useful – for example, critical path analysis, or finding common attributes that correspond to higher latency or error rates. Each of these relies on span properties and attributes following common semantics or at least being consistent across services.

Resource attributes

Resource attributes describe the process, host, service, and environment, and are the same for all spans reported by the service instance – for example, the service name, version, unique service instance ID, cloud provider account ID, region, availability zone, and K8s metadata.

These attributes allow us to detect anomalies specific to certain environments or instances – for example, an error rate increase only on instances that have a new version of code, an instance that goes into a restart loop, or a cloud service in a region and availability zone that experiences issues.

Based on standard attributes, observability vendors can write generic queries to perform this analysis or build common dashboards. It also enables the community to create vendor-agnostic tools and solutions for popular technologies.

Such attributes describe a service instance and don’t have to appear on every span – OTLP, for example, passes resource attributes once per batch of spans.

Performance analysis overview

Now that you know the core concepts around distributed tracing, let’s see how we can use the observability stack to investigate common distributed system problems.

The baseline

Before we talk about problems, let’s establish a baseline representing the behavior of a healthy system. We also need it to make data-driven decisions to help with common design and development tasks such as the following:

Risk estimation: Any feature work on the hot path is a good candidate for additional performance testing prior to release and guarding new code with feature flags.
Capacity planning: Knowing the current load is necessary to understand whether a system can handle planned growth and new features.
Understand improvement potential: It makes more sense to optimize frequently executed code, as even small optimizations bring significant performance gains or cost reductions. Similarly, improving reliability brings the most benefits for components that have a higher error rate and that are used by other services.
Learning usage patterns: Depending on how users interact with your system, you might change your scaling or caching strategy, extract specific functionality to a new service, or merge services.

Generic indicators that describe the performance of each service include the following:

Latency: How fast a service responds
Throughput: How many requests, events, or bytes the service is handling per second
Error rate: How many errors a service returns

Your system might need other indicators to measure durability or data correctness.

Each of these signals is useful when it includes an API route, a status code, and other context properties. For example, the error rate could be low overall but high for specific users or API routes.

Measuring signals on the server and client sides, whenever possible, gives you a better picture. For example, you can detect network failures and avoid “it works on my machine” situations when clients see issues and servers don’t.

Investigating performance issues

Let’s divide performance issues into two overlapping categories:

Widespread issues that affect a whole instance, server, or even the system, and move the distribution median.
An individual request or job that takes too much time to complete. If we visualize the latency distribution, as shown in Figure 1.12, we’ll see such issues in the long tail of distribution – they are rare, but part of normal behavior.

Figure 1.12 – Azure Monitor latency distribution visualization, with a median request (the 50th percentile) taking around 80 ms and the 95th percentile around 250 ms

Long tails

Individual issues can be caused by an unfortunate chain of events – transient network issues, high contention in optimistic concurrency algorithms, hardware failures, and so on.

Distributed tracing is an excellent tool to investigate such issues. If you have a bug report, you might have a trace context for a problematic operation. To achieve it, make sure you show the traceparent value on the web page and return traceresponse or a document that users need to record, or log traceresponse when sending requests to your service.

So, if you know the trace context, you can start by checking the trace view. For example, in Figure 1.13, you can see an example of a long request caused by transient network issues.

Figure 1.13 – A request with high latency caused by transient network issues and retries

The frontend request took about 2.6 seconds and the time was spent on the storage service downloading meme content. We see three tries of Azure.Core.Http.Request, each of which was fast, and the time between them corresponds to the back-off interval. The last try was successful.

If you don’t have trace-id, or perhaps if the trace was sampled out, you might be able to filter similar operations based on the context and high latency.

For example, in Jaeger, you can filter spans based on the service, span name, attributes, and duration, which helps you to find a needle in a haystack.

In some cases, you will end up with mysterious gaps – the service was up and running but spent significant time doing nothing, as shown in Figure 1.14:

Figure 1.14 – A request with high latency and gaps in spans

If you don’t get enough data from traces, check whether there are any logs available in the scope of this span.

You might also check resource utilization metrics – was there a CPU spike, or maybe a garbage collection pause at this moment? You might find some correlation using timestamps and context, but it’s impossible to tell whether this was a root cause or a coincidence.

If you have a continuous profiler that correlates profiles to traces (yes, they can do it with Activity.Current), you can check whether there are profiles available for this or similar operations.

We’ll see how to investigate this further with .NET diagnostics tools in Chapter 4, Low-Level Performance Analysis with Diagnostic Tools, but if you’re curious about what happened in Figure 1.14, the service read a network stream that was not instrumented.

Even though we talk about individual performance issues, in many cases we don’t know how widespread they are, especially when we’re at the beginning of an incident. Metrics and rich queries across traces can be used to find out how common a problem is. If you’re on call, checking whether an issue is widespread or becoming more frequent is usually more urgent than finding the root cause.

Note

Long-tail latency requests are inevitable in distributed systems, but there are always opportunities for optimization, with caching, collocation, adjusting timeouts and the retry policy, and so on. Monitoring P95 latency and analyzing traces for long-tail issues helps you find such areas for improvement.

Performance issues

Performance problems manifest as latency or throughput degradation beyond usual variations. Assuming you fail fast or rate-limit incoming calls, you might also see an increase in the error rate for 408, 429, or 503 HTTP status codes.

Such issues can start as a slight decrease in dependency availability, causing a service to retry. With outgoing requests taking more resources than usual, other operations slow down, and the time to process client requests grows, along with number of active requests and connections.

It could be challenging to understand what happened first; you might see high CPU usage and a relatively high GC rate – all symptoms you would usually see on an overloaded system, but nothing that stands out. Assuming you measure the dependency throughput and error rate, you could see the anomaly there, but it might be difficult to tell whether it’s a cause or effect.

Individual distributed traces are rarely useful in such cases – each operation takes longer, and there are more transient errors, but traces may look normal otherwise.

Here’s a list of trivial things to check first, and they serve as a foundation for more advanced analysis:

Is there an active deployment or a recent feature rollout? You can find out whether a problem is specific to instances running a new version of code using a service.version resource attribute. If you include feature flags on your traces or events, you can query them to check whether degradation is limited to (or started from) the requests with a new feature enabled.
Are issues specific to a certain API, code path, or combination of attributes? Some backends, such as Honeycomb, automate this analysis, finding attributes corresponding to a higher latency or error rate.
Are all instances affected? How many instances are alive? Attribute-based analysis is helpful here too.
Are your dependencies healthy? If you can, check their server-side telemetry and see whether they experience problems with other services, not just yours.

Attribute analysis can help here as well – assuming just one of your cloud storage accounts or database partitions is misbehaving, you will see it.

Did the load increase sharply prior to the incident? Or, if your service is auto-scaled, is the auto-scaler functioning properly, and are you able to catch up with the load?

There are more questions to ask about infrastructure, the cloud provider, and other aspects. The point of this exercise is to narrow down and understand the problem as much as possible. If the problem is not in your code, investigation helps to find a better way to handle problems like these in the future and gives you an opportunity to fill the gaps in your telemetry, so next time something similar happens, you can identify it faster.

If you suspect a problem in your code, .NET provides a set of signals and tools to help investigate high CPU, memory leaks, deadlocks, thread pool starvation, and profile code, as we’ll see in Chapter 4, Low-Level Performance Analysis with Diagnostic Tools.

Summary

Distributed systems need a new approach to observability that simplifies investigating incidents and minimizes the time to resolve issues. This approach should focus on human experience such as data visualization, the correlation across telemetry signals, and analysis automation. It requires structured, correlated telemetry signals that work together and new tools that leverage them to build a rich experience.

Distributed tracing is one such signal – it follows requests through any system and describes service operations with spans, the events representing operations in the system. .NET supports distributed tracing and integrates natively with OpenTelemetry, which is a cross-language platform to collect, process, and export traces, metrics, and logs in a vendor-agnostic way. Most modern vendors are compatible with OpenTelemetry and leverage distributed tracing capabilities. The OpenTelemetry ecosystem includes a diverse set of shared instrumentation libraries that automate common telemetry collection needs.

Distributed tracing enables correlation and causation by propagating context within the process and between services. OpenTelemetry defines standard semantics for common technologies so that vendors can build trace visualizations, application maps, shared dashboards, alerts, or queries that rely on consistent and standard attributes. Trace context and consistent attributes enable correlation between spans, logs, metrics, and any other signals coming from your system.

Individual issues can be efficiently analyzed with distributed tracing and investigations into widespread performance issues rely on attributes and timestamp correlation on metrics and across traces. Observability vendors may automate this analysis.

A combination of metrics, traces, and events gives the right number of details. Metrics allow us to receive unbiased data in a cost-effective way. By querying traces and events over high-cardinality attributes, we can answer ad hoc questions about the system.

In the next chapter, we’ll get hands-on experience with distributed tracing. We’ll build a demo application and explore native tracing capabilities in .NET.

Filter reviews by

All

Amazon verified reviews

Raj Sep 14, 2023

This book serves as a hands-on guide for software developers, architects, and operators eager to harness contemporary observability tools and standards in .NET environments. Liudmila lends her seasoned expertise on distributed tracing across a spectrum of programming languages, offering valuable insights into the effective use of OpenTelemetry—a vendor-neutral telemetry platform—for streamlined telemetry collection.The book is thorough in its treatment of telemetry within .NET applications, concentrating specifically on distributed tracing and performance analysis. It kicks off with an incisive overview of the challenges and solutions in the realm of observability. From there, it explores the native monitoring features available in modern .NET applications, and demonstrates how to employ OpenTelemetry for instrumenting common cloud architecture elements like network calls, messaging, and database interactions. The book confidently addresses both the managerial and technological dimensions essential for integrating and expanding observability in pre-existing infrastructures.The writing is clear and well-structured, complemented by a range of educational materials: code snippets, diagrams, screenshots, as well as practical tips, insightful questions, and recommendations for further reading. Liudmila offers pragmatic advice on selecting the appropriate telemetry signals, controlling operational costs, adhering to best practices, crafting custom conventions, leading organizational change, and thoroughly testing your instrumentation. Her deep-rooted understanding of both .NET and OpenTelemetry shines through, elucidating how these technologies synergize to enable effective observability.I wholeheartedly recommend this book to anyone interested in adopting modern tools and standards for monitoring and debugging .NET applications. It's a well-crafted, comprehensive resource that deftly balances theory and practice. My rating is a solid 5 out of 5 stars. Whether you're a novice or a seasoned developer looking to refresh your understanding of .NET observability, this book has something valuable to offer.

Amazon Verified review

Rohit Jul 31, 2023

As a software engineer navigating the constantly evolving world of distributed systems, I found "Modern Distributed Tracing in .NET" to be a valuable companion. This guide enhances your understanding of observability and distributed tracing. The author's hands-on experience in implementing observability across various SDKs and active contributions to the OpenTelemetry Community are evident, making the book even more credible.The book begins by addressing the challenges presented by modern microservices and distributed systems. It emphasizes the significance of robust observability tools while discussing the limitations of older tracing solutions. Notably, the book offers practical insights into leveraging observability in common cloud scenarios and provides valuable guidance on instrumenting messaging and database calls."If you are working on improving the observability of your applications in distributed environments, "Modern Distributed Tracing in .NET" is undoubtedly a must-have resource to have.

Miguel Angel Teheran Oct 02, 2023

I love this book and there are the reasons:1. This books explain briefly how to work with Logging with .NET and Observability but not taking too much. it moves quickly to practical things and advanced topics.2. The book contains information about OpenTelemetry this is a new standard for logging across different technologies or services in Microservices architecture.3. It explain how to use external tools to read and analyze the logging result in our applications.4. It contains some scenarios where we can use cloud services for monitoring and get metrics for our logs.

i Jul 30, 2023

I got the book to get a primer on the state of the art in tracing, monitoring, and debugging the distributed systems.The book is well written: understandable someone who's not domain expert while not compromising on the depth of the discussions.The examples are fun to follow, I ran a few of them with no issues.Overall, I enjoyed reading it end-to-end, highly recommend.

Johannes Tax Jul 14, 2023

Against what the title suggests, this book isn’t just about Distributed Tracing. Instead, it puts Distributed Tracing into the context of the overall observability ecosystem in .NET, including logging, metrics, and profiling. This holistic view is helpful for pragmatic and result-oriented engineers who want to gain the optimal value of observability in general and Distributed Tracing in particular.There’s a lot in this book: tracing, logging, metrics, profiling; instrumenting network calls, messaging scenarios, databases; OpenTelemetry and semantic conventions. While it is very dense, the connection to real-world scenarios is never lost, and every section offers insights that are useful for .NET programmers facing observability challenges.The first part of the book offers an introduction to Distributed Tracing and its concepts, giving a theoretical foundation and explaining how Distributed Tracing is embedded into the .NET ecosystem. The second part gives lots of valuable practical insights on instrumenting .NET applications and puts traces into a wider context with logs, metrics, and profiling. The third part goes into much detail about particular observability use cases, while the fourth part highlights challenges and strategies for implementing good observability practices in an organization.I consider this book a reference when it comes to the implementation of state-of-the-art observability practices in .NET applications, and I highly recommend it to anybody working with .NET microservices.The author is a thought leader in this space and doesn’t hesitate to reference cutting-edge and still experimental developments in this area. With this in mind, I encourage readers to take this book as a starting point but to look out for developments and improvements in specific areas, as observability practices and standards are constantly evolving. Nevertheless, for .NET developers and enthusiasts, there’s no better start and guide to your observability journey than this book.