Reviewing context propagation
Correlation and causation are the foundation of distributed tracing. We’ve just covered how related spans share the same trace-id
and have a pointer to the parent recorded in parent-span-id
, forming a casual chain of operations. Now, let’s explore how it works in practice.
In-process propagation
Even within a single service, we usually have nested spans. For example, if we trace a request to a REST service that just reads an item from a database, we’d want to see at least two spans – one for an incoming HTTP request and another for a database query. To correlate them properly, we need to pass span context from ASP.NET Core to the database driver.
One option is to pass context explicitly as a function argument. It’s a viable solution in Go, where explicit context propagation is a standard, but in .NET, it would make onboarding onto distributed tracing difficult and would ruin the auto-instrumentation magic.
.NET Activity (aka the span) is propagated implicitly. Current activity can always be accessed via the Activity.Current
property, backed up by System.Threading.AsyncLocal<T>
.
Using our previous example of a service reading from the database, ASP.NET Core creates an Activity for the incoming request, and it becomes current for anything that happens within the scope of this request. Instrumentation for the database driver creates another one that uses Activity.Current
as its parent, without knowing anything about ASP.NET Core and without the user application passing the Activity around. The logging framework would stamp trace-id
and span-id
from Activity.Current
, if configured to do so.
It works for sync or async code, but if you process items in the background using in-memory queues or manipulate with threads explicitly, you would have to help runtime and propagate activities explicitly. We’ll talk more about it in Chapter 6, Tracing Your Code.
Out-of-process propagation
In-process correlation is awesome, and for monolith applications, it would be almost sufficient. But in the microservice world, we need to trace requests end to end and, therefore, propagate context over the wire, and here’s where standards come into play.
You can find multiple practices in this space – every complex system used to support something custom, such as x-correlation-id
or x-request-id
. You can find x-cloud-trace-context
or grpc-trace-bin
in old Google systems, X-Amzn-Trace-Id
on AWS, and Request-Id
variations and ms-cv
in the Microsoft ecosystem. Assuming your system is heterogeneous and uses a variety of cloud providers and tracing tools, correlation is difficult.
Trace context (which you can explore in more detail at https://www.w3.org/TR/trace-context) is a relatively new standard, converting context propagation over HTTP, but it’s widely adopted and used by default in OpenTelemetry and .NET.
W3C Trace Context
The trace context standard defines traceparent
and tracestate
HTTP headers and the format to populate context on them.
The traceparent header
The traceparent
is an HTTP request header that carries the protocol version, trace-id
, parent-id
, and trace-flags
in the following format:
traceparent: {version}-{trace-id}-{parent-id}-{trace-flags}
version
: The protocol version – only00
is defined at the moment.trace-id
: The logical end-to-end operation ID.parent-id
: Identifies the client span and serves as a parent for the corresponding server span.trace-flags
: Represents the sampling decision (which we’ll talk about in Chapter 5, Configuration and Control Plane). For now, we can determine that00
indicates that the parent span was sampled out and01
means it was sampled in.
All identifiers must be present – that is, traceparent
has a fixed length and is easy to parse. Figure 1.8 shows an example of context propagation with the traceparent
header:
Figure 1.8 – traceparent is populated from the outgoing span context and becomes a parent for the incoming span
Note
The protocol does not require creating spans and does not specify instrumentation points. Common practice is to create spans per outgoing and incoming requests, and put client span context into request headers.
The tracestate header
The tracestate
is another request header, which carries additional context for the tracing tool to use. It’s designed for OpenTelemetry or an APM tool to carry additional control information and not for application-specific context (covered in detail later in the Baggage section).
The tracestate
consists of a list of key-value pairs, serialized to a string with the following format: "vendor1=value1,vendor2=value2"
.
The tracestate
can be used to propagate incompatible legacy correlation IDs, or some additional identifiers vendor needs.
OpenTelemetry, for example, uses it to carry a sampling probability and score. For example, tracestate: "ot=r:3;p:2"
represents a key-value pair, where the key is ot
(OpenTelemetry tag) and the value is r:3;p:2
.
The tracestate
header has a soft limitation on size (512 characters) and can be truncated.
The traceresponse (draft) header
Unlike traceparent
and tracestate
, traceresponse is a response header. At the time of writing, it’s defined in W3C Trace-Context Level 2 (https://www.w3.org/TR/trace-context-2/) and has reached W3C Editor’s Draft status. There is no support for it in .NET or OpenTelemetry.
traceresponse
is very similar to traceparent
. It has the same format, but instead of client-side identifiers, it returns the trace-id
and span-id
values of the server span:
traceresponse: 00-{trace-id}-{child-id}-{trace-flags}
traceresponse
is optional in the sense that the server does not need to return it, even if it supports W3C Trace-Context Level 2. It’s useful to return traceresponse
when the client did not pass a valid traceparent
, but can log traceresponse
.
External-facing services may decide to start a new trace, because they don’t trust the caller’s trace-id
generation algorithm. Uniform random distribution is one concern; another reason could be a special trace-id
format. If the service restarts a trace, it’s a good idea to return the traceresponse
header to caller.
B3
The B3 specification (https://github.com/openzipkin/b3-propagation) was adopted by Zipkin – one of the first distributed tracing systems.
B3 identifiers can be propagated as a single b3
header in the following format:
b3: {trace-id}-{span-id}-{sampling-state}-{parent-span-id}
Another way is to pass individual components, using X-B3-TraceId
, X-B3-SpanId
, X-B3-ParentSpanId
, and X-B3-Sampled
.
The sampling state suggests whether a service should trace the corresponding request. In addition to 0
(don’t record) and 1
(do record), it allows us to force tracing with a flag set to d
. It’s usually done for debugging purposes. The sampling state can be passed without other identifiers to specify the desired sampling decision to the service.
Note
The key difference with W3C Trace-Context, beyond header names, is the presence of both span-id
and parent-span-id
. B3 systems can use the same span-id
on the client and server sides, creating a single span for both.
Zipkin reuses span-id
from the incoming request, also specifying parent-span-id
on it. The Zipkin span represents the client and server at the same time, as shown in Figure 1.9, recording different durations and statuses for them:
Figure 1.9 – Zipkin creates one span to represent the client and server
OpenTelemetry and .NET support b3
headers but ignore parent-span-id
– they generate a new span-id
for every span, as it’s not possible to reuse span-id
(see Figure 1.10).
Figure 1.10 – OpenTelemetry does not use parent-span-id from B3 headers and creates different spans for the client and server
Baggage
So far, we have talked about span context and correlation. But in many cases, distributed systems have application-specific context. For example, you authorize users on your frontend service, and after that, user-id
is not needed for application logic, but you still want to add it as an attribute on spans from all services to query and aggregate it on a per-user basis.
You can stamp user-id
once on the frontend. Then, spans recorded on the backend will not have user-id
, but they will share the same trace-id
as the frontend. So, with some joins in your queries, you can still do per-user analysis. It works to some extent but may be expensive or slow, so you might decide to propagate user-id
and stamp it on the backend spans too.
The baggage (https://www.w3.org/TR/baggage/) defines a generic propagation format for distributed context, and you can use it for business logic or anything else by adding, reading, removing, and modifying baggage members. For example, you can route requests to the test environment and pass feature flags or extra telemetry context.
Baggage consists of a list of semicolon-separated members. Each member has a key, value, and optional properties in the following formats – key=value;property1;key2=property2
or key=value;property1;key2=property2,anotherKey=anotherValue
.
OpenTelemetry and .NET only propagate baggage, but don’t stamp it on any telemetry. You can configure ILogger
to stamp baggage and need to enrich traces explicitly. We’ll see how it works in Chapter 5, Configuration and Control Plane.
Tip
You should not put any sensitive information in baggage, as it’s almost impossible to guarantee where it would flow – your application or sidecar infrastructure can forward it to your cloud provider or anywhere else.
Maintain a list of well-known baggage keys across your system and only use known ones, as you might receive baggage from another system otherwise.
Baggage specification has a working draft status and may still change.
Note
While the W3C Trace Context standard is HTTP-specific and B3 applies to any RPC calls, they are commonly used for any context propagation needs – for example, they are passed as the event payload in messaging scenarios. This may change once protocol-specific standards are introduced.