You're reading from Modern Distributed Tracing in .NET A practical guide to observability and performance analysis for microservices

Product type Paperback

Published in Jun 2023

Publisher Packt

ISBN-13 9781837636136

Length 336 pages

Edition 1st Edition

Tools

.NET

Concepts

Application Development

Author (1):

Liudmila Molkova

View More author details

Table of Contents (23) Chapters

Preface

1. Part 1: Introducing Distributed Tracing

2. Chapter 1: Observability Needs of Modern Applications FREE CHAPTER

3. Chapter 2: Native Monitoring in .NET

4. Chapter 3: The .NET Observability Ecosystem

5. Chapter 4: Low-Level Performance Analysis with Diagnostic Tools

6. Part 2: Instrumenting .NET Applications

7. Chapter 5: Configuration and Control Plane

8. Chapter 6: Tracing Your Code

9. Chapter 7: Adding Custom Metrics

10. Chapter 8: Writing Structured and Correlated Logs

11. Part 3: Observability for Common Cloud Scenarios

12. Chapter 9: Best Practices

13. Chapter 10: Tracing Network Calls

14. Chapter 11: Instrumenting Messaging Scenarios

15. Chapter 12: Instrumenting Database Calls

16. Part 4: Implementing Distributed Tracing in Your Organization

17. Chapter 13: Driving Change

18. Chapter 14: Creating Your Own Conventions

19. Chapter 15: Instrumenting Brownfield Applications

20. Assessments

21. Index

Why subscribe?

22. Other Books You May Enjoy

Understanding why logs and counters are not enough

Monitoring and observability cultures vary across the industry; some teams use ad hoc debugging with printf while others employ sophisticated observability solutions and automation. Still, almost every system uses a combination of common telemetry signals: logs, events, metrics or counters, and profiles. Telemetry collection alone is not enough. A system is observable if we can detect and investigate issues, and to achieve this, we need tools to store, index, visualize, and query the telemetry, navigate across different signals, and automate repetitive analysis.

Before we begin exploring tracing and discovering how it helps, let’s talk about other telemetry signals and their limitations.

Logs

A log is a record of some event. Logs typically have a timestamp, level, class name, and formatted message, and may also have a property bag with additional context.

Logs are a low-ceremony tool, with plenty of logging libraries and tools for any ecosystem.

Common problems with logging include the following:

Verbosity: Initially, we won’t have enough logs, but eventually, as we fill gaps, we will have too many. They become hard to read and expensive to store.
Performance: Logging is a common performance issue even when used wisely. It’s also very common to serialize objects or allocate strings for logging even when the logging level is disabled.

One new log statement can take your production down; I did it once. The log I added was written every millisecond. Multiplied by a number of service instances, it created an I/O bottleneck big enough to significantly increase latency and the error rate for users.

Not queryable: Logs coming from applications are intended for humans. We can add context and unify the format within our application and still only be able to filter logs by context properties. Logs change with every refactoring, disappear, or become out of date. New people joining a team need to learn logging semantics specific to a system, and the learning curve can be steep.
No correlation: Logs for different operations are interleaved. The process of finding logs describing certain operations is called correlation. In general, log correlation, especially across services, must be implemented manually (spoiler: not in ASP.NET Core).

Note

Logs are easy to produce but are verbose, and then can significantly impact performance. They are also difficult to filter, query, or visualize.

To be accessible and useful, logs are sent to some central place, a log management system, which stores, parses, and indexes them so they can be queried. This implies that your logs need to have at least some structure.

ILogger in .NET supports structured logging, as we’ll see in Chapter 8, Writing Structured and Correlated Logs, so you get the human-readable message, along with the context. Structured logging, combined with structured storage and indexing, converts your logs into rich events that you can use for almost anything.

Events

An event is a structured record of something. It has a timestamp and a property bag. It may have a name, or that could just be one of the properties.

The difference between logs and events is semantical – an event is structured and usually follows a specific schema.

For example, an event that describes adding an item to a shopping bag should have a well-known name, such as shopping_bag_add_item with user-id and item-id properties. Then, you can query them by name, item, and user. For example, you can find the top 10 popular items across all users.

If you write it as a log message, you’d probably write something like this:

logger.LogInformation("Added '{item-id}' to shopping bag
  for '{user-id}'", itemId, userId)

If your logging provider captures individual properties, you would get the same context as with events. So, now we can find every log for this user and item, which probably includes other logs not related to adding an item.

Note

Events with consistent schema can be queried efficiently but have the same verbosity and performance problems as logs.

Metrics and counters

Logs and events share the same problem – verbosity and performance overhead. One way to solve them is aggregation.

A metric is a value of something aggregated by dimensions and over a period of time. For example, a request latency metric can have an HTTP route, status code, method, service name, and instance dimensions.

Common problems with metrics include the following:

Cardinality: Each combination of dimensions is a time series, and aggregation happens within one time series. Adding a new dimension causes a combinatorial explosion, so metrics must have low cardinality – that is, they cannot have too many dimensions, and each one must have a small number of distinct values. As a result, you can’t measure granular things such as per-user experience with metrics.
No causation: Metrics only show correlation and no cause and effect, so they are not a great tool to investigate issues.

As an expert on your system, you might use your intuition to come up with possible reasons for certain types of behavior and then use metrics to confirm your hypothesis.

Verbosity: Metrics have problems with verbosity too. It’s common to add metrics that measure just one thing, such as queue_is_full or queue_is_empty. Something such as queue_utilization would be more generic. Over time, the number of metrics grows along with the number of alerts, dashboards, and team processes relying on them.

Note

Metrics have low impact on performance, low volume that doesn’t grow much with scale, low storage costs, and low query time. They are great for dashboards and alerts but not for issue investigation or granular analytics.

A counter is a single time series – it’s a metric without dimensions, typically used to collect resource utilization such as CPU load or memory usage. Counters don’t work well for application performance or usage, as you need a dedicated counter per each combination of attributes, such as HTTP route, status code, and method. It is difficult to collect and even harder to use. Luckily, .NET supports metrics with dimensions, and we will discuss them in Chapter 7, Adding Custom Metrics.

What’s missing?

Now you know all you need to monitor a monolith or small distributed system – use metrics for system health analysis and alerts, events for usage, and logs for debugging. This approach has taken the tech industry far, and there is nothing essentially wrong with it.

With up-to-date documentation, a few key performance and usage metrics, concise, structured, correlated, and consistent events, common conventions, and tools across all services, anyone operating your system can do performance analysis and debug issues.

Note

So, the ultimate goal is to efficiently operate a system, and the problem is not a specific telemetry signal or its limitations but a lack of standard solutions and practices, correlation, and structure for existing signals.

Before we jump into distributed tracing and see how its ecosystem addresses these gaps, let’s summarize the new requirements we have for the perfect observability solution we intend to solve with tracing and the new capabilities it brings. Also, we should keep in mind the old capabilities – low-performance overhead and manageable costs.

Systematic debugging

We need to be able to investigate issues in a generic way. From an error report to an alert on a metric, we should be able to drill down into the issue, follow specific requests end to end, or bubble up from an error deep in the stack to understand its effect on users.

All this should be reasonably easy to do when you’re on call and paged at 2AM to resolve an incident in production.

Answering ad hoc questions

I might want to understand whether users from Redmond, WA, who purchased a product from my website are experiencing longer delivery times than usual and why – because of the shipment company, rain, cloud provider issues in this region, or anything else.

It should not be required to add more telemetry to answer most of the usage or performance questions. Occasionally, you’d need to add a new context property or an event, but it should be rare on a stable code path.

Self-documenting systems

Modern systems are dynamic – with continuous deployments, feature flag changes in runtime, and dozens of external dependencies with their own instabilities, nobody can know everything.

Telemetry becomes your single source of truth. Assuming it has enough context and common semantics, an observability vendor should be able to visualize it reasonably well.

Auto-instrumentation

It’s difficult to instrument everything in your system – it’s repetitive, error-prone, and hard to keep up to date, test, and enforce common schema and semantics. We need shared instrumentations for common libraries, while we would only add application-specific telemetry and context.

With an understanding of these requirements, we will move on to distributed tracing.

You're reading from Modern Distributed Tracing in .NET A practical guide to observability and performance analysis for microservices

Table of Contents (23) Chapters

Understanding why logs and counters are not enough

Logs

Events

Metrics and counters

What’s missing?

Systematic debugging

Answering ad hoc questions

Self-documenting systems

Auto-instrumentation

Authors (1)

Personalised recommendations for you