Observability — Distributed Request Tracing with OpenTelemetry
When I started learning about , especially , I encountered repeated theoretical phrases and definitions almost everywhere, which didn’t make sense to me at first. I had to scour the entire internet to piece together these bits and pieces until they started to make sense. So, in this article, I’m not going to reiterate the same theoretical concepts over and over. Instead, I am going to initiate a discussion with the sources of truth that I found most helpful in understanding the fundamental concepts. My primary goal is to lay a foundation for understanding the basics of request tracing in distributed architectures. Finally, I will conclude by discussing how we can leverage , one of the widely adopted standards in this context.
Let’s start the discussion with the most common definition of Observability.
Observability lets us understand a system from the outside, by letting us ask questions about that system without knowing its inner workings. Furthermore, it allows us to easily troubleshoot and handle novel problems (i.e. “unknown unknowns”), and helps us answer the question, “Why is this happening?”
In order to be able to ask those questions of a system, the application must be properly instrumented.
Let’s think this through an analogy. When you visit a doctor, let’s say for the flu, she won’t cut you open to see what’s wrong. Instead, she will use tools like a stethoscope and thermometers to observe the signals emitted by your body. Based on these signals, she can gain insights into what is happening. A similar strategy is used when observing software. We monitor the signals emitted by software systems to understand their internals.
Q: What are these signals ?
Most commonly Logs , Metrics and Traces ( or any other output you can think of that will help you to observe should be)
Q: What exactly is instrumentation ?
Instrumentation means adding bunch of code to make the software emit the said signals. Logs are the most familiar example of instrumentation.
Q: How can these signals be used efficiently to observe a system? In which context will they be most useful?
This is where things get interesting…
Signals can emit a wealth of details about the internals of your application. You can use this information in various contexts. For instance, emitted signals can help debug the application, check its health, or ensure performance requirements are met.
- In order to debug your app you could use logs or tracers.
- To check whether your application is healthy you could introduce a health endpoint or you could simply just check logs to see whether it’s healthy or not.
- Latency measures of a critical business function could be derived from logs, metrics or tracers
You can leverage these signals in any practical way as long as it meets your expectations. The point here is that signals can be utilized to achieve multiple objectives and each signal type has its pros and cons, and utilizing these signals effectively is up to you (as long as you know what you’re doing).
Q: When it comes to end-to-end (e2e) request tracing, which signals can we utilize?
Enabling end-to-end (e2e) tracing means we are enhancing the debuggability of our solution. If your solution is single-threaded and monolithic, you can simply use logs for e2e request tracing. There’s nothing wrong with that as long as you have the necessary information to trace the request path.
For solutions with multiple process/service interactions, you could also think of a solution where you pass a unique ID along with each request and log it to track the request path. You could also use tracers which were originated to solve this very problem.
(Tracers can be used regardless of whether your solution is monolithic or consists of multiple services.)
Q: What is the exact difference between logs and traces? When should we use one over the other?
Using tracers to track a request from one end to the other is more straightforward, as they were specifically designed to solve this problem. The usability of tracers thrives in distributed software architectures. A distributed is defined as a collection of .
A span can also be considered a structured log, encapsulating information about correlation — who is calling whom — along with other metadata that we can use to form a single trace. Traditional logs, however, aren’t always useful for tracking code execution, as they typically lack contextual information, such as where they were called from.
…… A Span looks like a structured log. That’s because it kind of is!
One way to think of Traces is that they’re a collection of structured logs with context, correlation, hierarchy, and more baked in. However, these “structured logs” can come from different processes, services, VMs, data centers, and so on. This is what allows tracing to represent an end-to-end view of any system.
Let’s try to understand the depth of the problem we are trying to solve here a bit.
- Large organizations like Netflix, which handle millions of requests, often need to answer user queries such as, “?” To answer these types of questions productively, you can’t rely solely on a solution based on logs. Logs and metrics still have their place, but to see the full picture, you need to trace the request from the user’s device to whatever storage Netflix uses to store its content.
- Companies like Uber, which provide multiple solutions such as , and have a direct impact on people’s lifestyles. At this scale, being able to trace a request end-to-end to troubleshoot issues is a must. (Uber has been maintaining its journey in Observability, particularly in Distributed Tracing, explaining how they started with an in-house solution and later adopted open-source standards. Pretty interesting stuff! I have referenced them at the end of this post)
Even if your operations are not as complex as those of Uber or Netflix, imagine the complexity of having to maintain even a few services. Often, you may have to use external services from cloud vendors or third parties, over which you have no control or visibility. With all the network calls, delays, retries, and unknown parameters with varying loads, it is always a wise idea to enable request tracing to effectively diagnose and resolve issues.
Q: How exactly this distributed request tracing works ?
We have decoupled services from each other for obvious reasons, but by doing so, we sacrifice the debuggability of our solution. Now, we want to know who is calling whom, how long it takes to complete a single task, what the slowest operation is, which services are contributing to the overall slowdown, the order of service interactions, why a particular service X is not meeting its SLO, which requests are taking much more time, and the paths they have taken.
To answer these questions more productively in a distributed environment, we use distributed request tracing. Simply put, we are creating an “object” with a unique ID and a collection of key-value pairs, including details about who is calling whom and other contextual information (which we call a span). Services export these pieces of information to an external observability platform, where they can be aggregated, analyzed, and visualized to gain insights into the overall system behavior.
Another piece of information, the context that contains the information for the sending and receiving service to correlate one signal with another (includes trace identifier, span identifier, and so on), is passed with the request itself (context propagation). Context propagation ensures that as the request travels through various services, each service has access to the necessary context to facilitate tracing.
Q: How can we instrument the codebase to enable distributed request tracing? Are there any mainstream libraries like the ones we’ve been using for logs?
Various vendors have been putting a tremendous engineering effort into solving this very same problem. You should have encountered a sheer amount of tools provided by different organizations like Jaeger, Zipkin, Grafana Labs, AWS, Google Cloud, Azure, Splunk. Oh boy, the list goes on. Each vendor provides a comprehensive toolkit for addressing your observability needs (not only tracing). The solution scope they are providing for Observability might differ from vendor to vendor.
But the downside of this is when we used a particular vendor specific library for instrumentation it is not that intuitive to combined the spans that has instrument using another library. I’ve found this great article on medium where explains how e2e request tracing can be achieved in a practical world.
But in an ideal world we should avoid vendor lock-in to any particular instrumentation library.
Q: Don’ we have a standard approach ?
Well, the thing about standards (not particularly in software but in every aspect of life) is exactly the above. But, if you follow a widely adopted standard, you get to avoid vendor lock-in.
and were two such different standards that have to form a single standard called OpenTelemetry. OpenTelemetry is an observability framework that supports not only tracers but also logs and metrics.
Q: How could OpenTelemetry be useful in distributed tracing?
. Once you instrument your code to enable tracing using OpenTelemetry then;
…. well then you multiple choices from there.
- You could leverage (OLTP). OTLP is a telemetry data delivery protocol. Most can consume OpenTelemetry natively via . That means once we instrument our code using OpenTelemetry then we can export our tracers to a wide array of vendors.
- You could also leverage an . It is a vendor-agnostic implementation of receiving, processing and exporting telemetry data.
- Or you could directly export your traces to an . For instance, in java you could use for export OTel tracers to directly.
(“Back in May of 2022, the Jaeger project announced native support for the OpenTelemetry Protocol (OTLP). This followed a for the Jaeger client libraries across many languages. With these changes, OpenTelemetry users are now able to send traces into Jaeger with industry-standard OTLP, and the Jaeger client library repositories have been finally archived” — source)