Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to specify ActivityContext of new Activity to StartActivity(..) to unblock interop with real-world vendors #42786

Closed
macrogreg opened this issue Sep 27, 2020 · 19 comments
Labels
area-System.Diagnostics.Activity enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@macrogreg
Copy link

macrogreg commented Sep 27, 2020

Another one :)
(On a personal note: I really wished we had started this work stream at the OTel group earlier, so that these requests had more chance to flow into .NET 5, but I believe that it still has value to point out potential improvement opportunities. :) )

I propose that from the perspective of Open Telemetry, it is a critical scenario that a distributed operation can be traced across microservices that are monitored by a heterogeneous vendor ecosystem, using manual and/or auto-instrumentation. Such vendor systems may use proprietary, non-W3C-based trace and span IDs. As long as each local vendor system interoperates with the W3C trace context in a manner compliant to the W3C specification, the trace context will flow correctly across service boundaries. For that, the .NET Activity API MUST allow explicitly specifying the W3C trace context details (trace and span IDs) for new spans in a manner compliant with the W3C recommendation [L1, L2]. Currently, that is not supported.

(I'd be very happy to be corrected in I am missing something. :) )

W3C specifies how to interoperate between non-W3C-based trace and span IDs potentially used by vendors and W3C-compliant trace contexts:
On a high level, that involves:

  • taking the vendor-specific trace/span ID,
  • using it to generate the W3C trace context in a well-defined, compliant manner (see the referenced spec section, and other sections in the W3C doc)
  • setting the trace context generated in this manner on the current/new span

At the very least, if should be possible to restrict the number of significant bytes in the IDs, as described by W3C [L1, L2].

However, that is currently not possible with the ActivitySource based API. It is possible to supply the parent context, but the user has no control over the generation of the trace context for the new span.

We should consider adding ActivitySource.StartActivity(..) overloads that allow specifying an ActivityContext for the span being created. If this is too late for this in .NET5, we should make and document an explicit workaround recommendation.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Diagnostics.Tracing untriaged New issue has not been triaged by the area owner labels Sep 27, 2020
@ghost
Copy link

ghost commented Sep 27, 2020

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

@macrogreg
Copy link
Author

@macrogreg macrogreg changed the title Allow to specify ActivityContext of new activity to StartActivity(..) to unblock interop with real-world vendors Allow to specify ActivityContext of new Activity to StartActivity(..) to unblock interop with real-world vendors Sep 27, 2020
@tarekgh tarekgh added enhancement Product code improvement that does NOT require public API changes/additions and removed untriaged New issue has not been triaged by the area owner labels Sep 27, 2020
@tarekgh tarekgh added this to the Future milestone Sep 27, 2020
@tarekgh
Copy link
Member

tarekgh commented Sep 27, 2020

This scenario is currently not supported. It is something we can consider looking at in the future releases.

@macrogreg
Copy link
Author

Thanks, @tarekgh /
I realize the current sport in the release cycle. :) Please let me know if you guys consider these submissions as a valuable input for forward looking planning / for the next cycle. I do not want to create a distraction if it is not productive. :)

Thanks!

@tarekgh
Copy link
Member

tarekgh commented Sep 28, 2020

@macrogreg your feedback is valuable please don't stop sending it. I was just trying to clarify that we are done with 5.0. Thank you too.

@tarekgh tarekgh modified the milestones: Future, 6.0.0 Sep 28, 2020
@macrogreg
Copy link
Author

Thanks for confirming, @tarekgh .
Yup agreed, at some point .NET 5 needed to be locked down.
Looking forward to a discussion in the context of future releases. :)

@noahfalk
Copy link
Member

noahfalk commented Sep 28, 2020

@macrogreg - I don't think I fully understand how the scenario is ideally intended to work. Could you walk me through a simple example? The setup I am imagining is we have some non-W3C system sending a message to a monitored .NET service. The message still has some type of correlation identifier attached to it but it doesn't follow trace-context conventions. It sounds like the goal was inside the .NET process we want to generate a W3C compliant id and I am guessing a monitoring tools like datadog wants to record some telemetry establishing the parent-child relationship at the boundary between these two services. Assuming you had a free hand to implement the behavior of the .NET process however you wanted to get the best result...

  • what example id info arrives in the message?
  • what example id info is created within the .NET process?
  • what telemetry is logged that records receiving this request and what identifiers does it contain?

Thanks!

@macrogreg
Copy link
Author

Hi @noahfalk ,

When I submitted this last weekend I operated in the mode "it's unlikely that we can change anything for .NET 5, but let's still raise the point". For a subsequent release we may want to consider going one step further:

We may face an architectural weakness here:
Is a Activity a data exchange type of an Algorithm implementation type? It seems to be both, which tends to create challenges.
OTel addresses similar challenges by separating API from SDK. Activity feels like a data exchange type (API style). It represents a sub-operation with some properties. However, the mechanism of creating IDs and propagating them belongs into the area of OTel SDK. It depends on the application and the monitoring environment. Perhaps in the time frame of a future release, Trace Context should be more abstract and constructed via a dependency injection mechanism?

Let me switch to more concrete case. Let me use Datadog to describe how I would implement context propagation in a potentially heterogenous monitoring environment. (I think other vendors would be in a very similar spot.) Together, we can decide if this makes sense or if something is missing. :)

W3C defined the length of the traceId and the spanId. It recommends it to be random.
A specific vendor might:

  • Use IDs of a different length.
  • Give some bits in the IDs a semantic meaning.

This case is particularly easy, because Datadog uses:

  • Shorter Trace IDs
  • Same length Span IDs
  • No semantic meaning of ID bits.
    The approach I describe here may not for other vendors who have stronger limitations. Of course, I'd be happy to be taught otherwise. :)

Datadog is using a shorter trace ID (for historical reasons, 16 bytes instead of 32 as recommended by W3C). We also have a specific algorithm to generate those IDs that we prefer over the Guid-based approach built into Activity. We use proprietary headers to propagate trace context.

  • If a requests arrives and it carries only a W3C trace info. We will use it to configure the parent context for the local root activity. When sending a record of this local root activity to our backend we will:

    • use the 16 bytes parent span ID that was received for our parent span ID
    • the 16 bytes span ID that were automatically generated then this Activity was created for the span ID
    • and the the lower 16 bytes of the 32 bytes trace id for the trace id.
      When we propagate downstream, we will add both, W3C headers and Datadog headers. The later will contain the information that we used when recording the this activity.
  • If a request arrives and it carries both, Datadog and W3C trace info, we will prefer the Datadog info. We will construct a parent activity context by using the data from our proprietary headers. If it matched the W3C information, then great, if not, we will still use ours.

    • If the 16-byte-long trace id information in Datadog headers matches the information in the lower-16-bytes of the W3C headers, we will use all of this information when constructing the parent context. That will the trace context will follow correctly in case that downstream monitoring is a different vendor. When sending a record to our backend and when constructing Datadog headers for downstream calls, we will use the lower 16 bytes only.
    • If the Datadog and W3C headers do not match as above, we will ignore the W3C headers. We will use the 16 bytes form the Datadog headers to construct the trace ID of the parent context for the local root span, and the top 16 bytes of the trace ID will be set to zero or to some well-defined constant. When we propagate downstream, we will add both, W3C headers and Datadog headers, the W3C headers will be based on the trace context we just constructed. I.e. we ensure continuity of a trace whenever possible, but in case of a contradiction, we prefer our internal info.
  • A request does not carry W3C or Datadog headers. So there is no parent Span. We need to "invent" a new Trace ID, but we need to set it not as a parent context, but as the trace id of this span (Activity). That is the missing APIs I am referring to. So .NET will invent a 32 bytes Trace ID for us. Just like above, we will use the lower 16 bytes of it for storing records and for propagating a DD header, and the full 32 bits for propagating in a W3C trace context header. What I prefer to see instead is same as in the case above - the ability to generate a 16 bytes trace ID and to fill it with a constant to 32 bytes. That way the case "there was no incoming W3C info that we could use" would be handled consistently.

@noahfalk
Copy link
Member

Thanks @macrogreg : )

When we propagate downstream, we will add both, W3C headers and Datadog headers. The later will contain the information that we used when recording the this activity.

Does this mean the id that flows in-process would be the complete 32 byte trace id or you would prefer to truncate it to the 16 byte id with zeros? It was unclear what value is in that outbound W3C header.

A request does not carry W3C or Datadog headers. So there is no parent Span. We need to "invent" a new Trace ID, but we need to set it not as a parent context, but as the trace id of this span (Activity). That is the missing APIs I am referring to

Are you anticipating all Activities to have 16 byte zero filled ids, or just certain ones that DataDog code is responsible for creating?

@macrogreg
Copy link
Author

When we propagate downstream, we will add both, W3C headers and Datadog headers. The later will contain the information that we used when recording the this activity.

Does this mean the id that flows in-process would be the complete 32 byte trace id or you would prefer to truncate it to the 16 byte id with zeros? It was unclear what value is in that outbound W3C header.

I am not sure I understand the question. When propagating downstream, we would include both DD and W3C header because we do not know what monitoring solution the downstream system is using.

  • If we received both W3C and DD from upstream and they agree (lower 16 bytes match), we use the fully received W3C id (all 32 bytes) when creating the downstream W3C header. This way the context is propagated without breaking.
  • If we received both W3C and DD from upstream and they do not agree (lower 16 bytes do not match), we would prefer the incoming DD header. We basically no longer trust the W3C info and throw it away. Instead, we create a new W3C id based on the DD info that we have. Lower 16 bytes of that new id will be a copy of the DD id, the upper 16 bytes will be a constant. When propagating downstream we will use that new W3C id (and the DD header). So a downstream system will see consistent values in these 2 headers.

A request does not carry W3C or Datadog headers. So there is no parent Span. We need to "invent" a new Trace ID, but we need to set it not as a parent context, but as the trace id of this span (Activity). That is the missing APIs I am referring to

Are you anticipating all Activities to have 16 byte zero filled ids, or just certain ones that DataDog code is responsible for creating?

If valid & consistent W3C info came in with the request we can use all 32 bytes of it in the way described above. If the Trace ID is invented by a DD system, it should only populate 16 bytes of the trace ID and leave the rest constant. However, the current API forces us to leave the trace id creation to the Activity internal logic. It will populate all 32 bytes and we will use only 16 of them for DD ids.

See what I mean? :)

@noahfalk
Copy link
Member

I am not sure I understand the question

Then you guessed well because you still answered it. I wanted to know if you would preserve all 32 bytes of the W3C identifier when you propogate the id in memory and you confirmed that yes, you would (assuming it matches the DataDog header).

If the Trace ID is invented by a DD system, it should only populate 16 bytes of the trace ID

I am trying to figure out in what circumstances you want DD to be in charge of inventing these IDs. For example the code below is a hypothetical customer console app that has no pre-existing W3C or DataDog header (because there was no incoming message at all). Are you aiming for DataDog to gain control of the TraceId generated for this Activity? I'm guessing no but I am still trying to fill in the gaps in my understanding : )

public void Main(string[] args)
{
    using(Activity? a = source.StartActivity("MyActivity"))
    {
        SendSomeHttpMessages();
    }
}

@macrogreg
Copy link
Author

I think this is a great example. If there is no Listener then there is no Activity.
And if there is a Listener, then I am proposing, the Listener should invent the ID.
Or perhaps it is not the listener, but a different dependency injection mechanism (since there may be several listeners, but only one can invent IDs).

The meta point is that the ID format is the "business" of the telemetry consumer, not the telemetry producer.
Datadog is "lucky" in the sense that we can always map W3C trace ids to DD ids by simply throwing away 16 upper bytes. But looking at this from the BCL team's perspective - can we be sure that it's the case for all vendors?

@noahfalk
Copy link
Member

I think this is a great example. If there is no Listener then there is no Activity.
And if there is a Listener, then I am proposing, the Listener should invent the ID

I am assuming DataDog would have done some code-less attach to get a listener hooked up, right? So you are saying in this scenario you are hoping for DataDog to be in control of creating the ID?

The meta point is that the ID format is the "business" of the telemetry consumer, not the telemetry producer

While I agree that consumers are the ones who care, increasingly there isn't a single consumer and the app developer may play a role as both consumer and producer. ILogger puts the IDs in log messages, customers can write their own ActivityListeners, and OpenTelemetry/code-less instrumentation may be storing the IDs as well. It feels like your goal to have DataDog be able to control the IDs of all root Activities is fundamentally at odds with the platform goal to use a standard format. Instead of seeking to control the ID that shows up on Activity.ID, I would treat it as a W3C based system that you need to interoperate with. I think you have several mechanisms at your disposal to do so:

  • You can intercede what ID is logged in DataDog's database, which need not be identical to the W3C id (truncate it)
  • You can intercede about what headers are written on the outbound messages
  • You can store additional custom id information in the TraceState
  • We can collaborate on mechanisms that give you more control when an inbound message that contains a DataDog header needs to create an Activity. These extensibility points aren't really part of Activity, they are part of networking stacks like Kestrel or GRPC.

@macrogreg
Copy link
Author

For Datadog we do not have a problem, we can follow the strategy described earlier (see remarks above). I am trying to look at it form the BCL perspective. It is also relevant as we are working on OTel auto-instrumentation. There, we do not know who the vendor is and what ID format they are using, so I am trying to be generic. :)

There may be many exporters. None of them is the "leader" so they cannot be determining the id.
However, the application may have an opinion on where the telemetry should go and it may want to dependency-inject the leader?

It feels like your goal to have DataDog be able to control the IDs of all root Activities is fundamentally at odds with the platform goal to use a standard format.

Is the standard format really a goal? If yes, OK. But (arguably), if Activity is a data exchange type, it should not prescribe a distributed id format. After all, the distributed trace may use different languages, vendors, versions for the respective microservices. And:

We can collaborate on mechanisms that give you more control when an inbound message that contains a [whichever] header needs to create an Activity. These extensibility points aren't really part of Activity, they are part of networking stacks like Kestrel or GRPC.

Exactly, that's what I am getting at. :)
Perhaps Activity should just have a super fast numeric locally unique id and then an abstract representation of a distributed ID. Creation and propagation of such distributed ID would be dependency-injected and decided by the application or auto-instrumentation. .NET would include an implementation of that for W3C, but in general, it would be one of may possible implementations. E.g. B3 format is cross-vendor an may be even more frequently used as W3C in practice (I do not have data, but it is at least popular to some degree).

@noahfalk
Copy link
Member

Is the standard format really a goal?

Yes : ) For multiple reasons:

  • W3C is aiming to set an industry standard and .NET is taking a bet on it. One goal of using the standard is so that we don't need to do large amounts of work supporting other vendor defined formats.
  • Activity is expected to update the ID when new child activities are formed. That requires update rules that are format specific.
  • For performance and usability the ID is exposed with strongly typed APIs, not as an opaque blob.
  • The ID isn't treated as an opaque string by its consumers either, it has sub-structure such as the split between flags, spanId, and traceId. This requires everyone to agree how to interpret the data to extract this information or these consumers would be unable to interoperate.
  • There is ultimately only one thing that gets to populate Activity.ID. If the platform isn't willing to standardize it then we have to delegate the responsibility for controlling it to some other component. That likely leads to multiple components competing to determine which of them gets control and probably incompatibility from whichever components aren't delegated control. These are messy problems I'd like to avoid.

We can collaborate on mechanisms that give you more control when an inbound message that contains a [whichever] header needs to create an Activity. These extensibility points aren't really part of Activity, they are part of networking stacks like Kestrel or GRPC.

Exactly, that's what I am getting at. :)

I'm not suggesting that you get to set an arbitrary format, I'm suggesting you get to define what W3C trace context will be treated as the parent, if any.

@macrogreg
Copy link
Author

Activity is expected to update the ID when new child activities are formed. That requires update rules that are format specific.

I am not sure I understand. :)

W3C is aiming to set an industry standard and .NET is taking a bet on it. One goal of using the standard is so that we don't need to do large amounts of work supporting other vendor defined formats.

Fair enough. This can be a problem for a vendor who is not using W3C. E.g. OTel is not (yet) using it. In the long-term this makes complete sense for .NET as having this standardization would make things easier.

On the other hand, if a vendor is not using a W3C compatible system they may be disincentivized from using Activities and ship their own libraries that represent Spans.

There is ultimately only one thing that gets to populate Activity.ID. If the platform isn't willing to standardize it then we have to delegate the responsibility for controlling it to some other component. That likely leads to multiple components competing to determine which of them gets control and probably incompatibility from whichever components aren't delegated control. These are messy problems I'd like to avoid.

Gotcha. I completely see the desire for simplicity. Its just that everyone who wants to use Activities must be W3C compatible and as a result adoption may be slower. Perhaps, this is a fine bet to take. But let's be aware of the risk. :)

@noahfalk
Copy link
Member

Activity is expected to update the ID when new child activities are formed. That requires update rules that are format specific.

I am not sure I understand. :)

SpanId changes at each child and this is part of trace context.

Perhaps, this is a fine bet to take. But let's be aware of the risk. :)

I hope so, it is the bet we took : ) Risk acknowledged!

@noahfalk
Copy link
Member

OTel is not (yet) using it

Your link is refering to wire formats. For in-memory formats (the Span in OTel or Activity in .NET), OTel is standardized on W3C.

Activity itself doesn't specify any behavior for how an ID is deserialized from an inbound message or serialized onto an outbound one. The bridge between wire formats and in-memory formats is controlled by the networking libraries with varying amounts of customization opportunity for 3rd parties.

@tarekgh
Copy link
Member

tarekgh commented Jan 21, 2021

Closing this issue as we are going to track the solution proposal in the issue #46704. Feel free to comment there. You may look at the proposed change in the comment #46704 (comment). Thanks!

@tarekgh tarekgh closed this as completed Jan 21, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Feb 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Diagnostics.Activity enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

No branches or pull requests

4 participants