Propagating SamplingFlags as per W3CTraceContext guidelines. #2156

cijothomas · 2021-01-22T00:24:43Z

Background:
This SDK has adopted W3CTraceContext recommendation as the default for distributed context propagation. In recent version of .NET Core (3.1 and new), the propagation is done by the libraries (HttpClient, AspNetCore) themselves, and not by this SDK. In older versions, autocollectors in this repo takes care of propagation. Propagation is done with the support of Activity class, whose Activity.Id field corresponds to the W3CTraceContext header "traceparent".

Examples:
For HttpClient in .NET Core 3.1 and newer, the HttpClient library itself adds traceparent header to outbound calls, with
Activity.Current.Id as the value.
For HttpClient in .NET Core 2.1 the DependencyCollectionTelemetryModule adds traceparent header to outbound calls, with Activity.Current.Id as the value.

For Asp.Net Core 3.1 and newer, the Asp.Net Core framework itself extracts traceparent from incoming request headers, and
starts an Activity with incoming traceparent as its parent.
For Asp.Net Core 2.1, the RequestCollectionModule extracts traceparent from incoming request headers, and starts an Activity with incoming traceparent as its parent.

Current issue with context propagation and Sampling
Activity.Id (traceparent) is composed of "version-traceid-spanid-traceflags", with the traceflags currently supporting a single flag "Sampled". Depending on the sampling strategy, this flag may/may not be used. For example:, if an application makes "delayed" sampling decision, then it may not propagate the traceflags which indicates its sampling decision. For an application which receives an incoming request with traceflags, it can chose to ignore it for variety of reasons, like security considerations, diff. in load etc.

Application Insights follow a "tail-based" or "delayed" sampling model, where sampling decision is made after collecting the entire telemetry item. For example, a RequestTelemetry item is created at first, its fields are populated by the auto collector module, and all TelemetryInitializers are run which can further populate/entirch the telemetry. The sampler is run after these, as a TelemetryProcessor.
Because of its delayed sampling model, Application Insights does not respect the incoming trace flags, and does not modify traceflags when propagating to next node.

This causes issues when interoping with other systems, which follow a different algorithm. Due to lack of the support of sampling traceflags, its not possible to coordinate sampling decisions between 2 nodes operating different sampling algorithms

Proposal

Switch ApplicationInsights to head-based sampling. This shift can allow ApplicationInsights to respect the incoming traceflags for its sampling decision, and also allow it to propagate its sampling decision to next node.

However, this is a major behavior change in ApplicationInsights, and can be done only with a major version bump to 3.0

Alternate proposal

Modify Sampled flag, based on the "expected" probability of the telemetry being sampled in.
This involves calculating the samplingscore and comparing it with current_sampling_rate, and adjusting Sampling flag accordingly before the request is sent.
For example:
While Http call is being made, the Activity's TraceId is used to calculate SamplingScore, and compare it with SamplingPercentage at that time. If the score is < samplingpercentage, modify SamplingFlags to Sampled. Propagate traceparent as before.
It is possible that, the actual sampling decision may be to drop this item. This can occur if too many other telemetry items were produced/samplingpercentage got changed in between. As per W3CTrace-Context spec, this is okay.

This is not a breaking change, so can be included in 2.X version itself.

It involves the following:

From dependency collector, obtain the current Activity's TraceId, and run SamplingScore generator and compare the current sampling percentage for that telemetry type. (Sampler can have different sampling percentage for Request and Dependency)
Make the sampling decision, and set SamplingFlags accordingly in the current Activity.
Store the calculated samplingscore somewhere so that it can be re-used when the actual Sampler runs.
Modify the Samplers to look for pre-calculated sampling score and use it, if exists. Else proceed as before.
Confirm that the modified Activity.Id is the propagated to the next hop for the following:
HttpClient (.NET Core) - Yes
HttpClient (.NET Fw) - Yes
Azure clients - To be investigated.

The text was updated successfully, but these errors were encountered:

zakimaksyutov · 2021-01-27T23:07:44Z

Looks good to me. Alternative proposal is backward compatible and addresses the issue.

vishweshbankwar · 2021-11-19T22:56:17Z

Updating traceflags on activity will not fix this issue entirely as we rely on HttpClient and Azure Clients to propagate traceparent header. In both the libraries traceparent is set using Activity.Id which does not reflect the updated traceflags on activity.
Reference Issue: dotnet/runtime#61857.

cijothomas · 2021-11-19T23:15:46Z

Updating traceflags on activity will not fix this issue entirely as we rely on HttpClient and Azure Clients to propagate traceparent header. In both the libraries traceparent is set using Activity.Id which does not reflect the updated traceflags on activity. Reference Issue: dotnet/runtime#61857.

Few workarounds:

Instead of relying on HttpClient, propagate traceparent ourselves, by re-constructing traceparent by hand, instead of relying on activity.Id, reflecting modified TraceFlags.
This may not be possible for Azure Clients.
Do no access Activity.Id. (Might break if someone else writes a listener, and accesses it..)
[Hack] Use reflection and force Activity.Id to recalculate .Id, after setting TraceFlags ourselves.

cijothomas · 2021-11-22T22:15:50Z

At the root of the issue is the fact that we are trying to modify TraceFlags for an activity that has already started. This is only an issue for "legacy" instrumentations which creates activity the legacy way (i.e new Activity().Start())., and not an issue for ActivitySource/OpenTelemetry way, as the sampler's are invoked before the activity gets created, and hence TraceFlags are correctly set.

Given the fact that this issue is only affecting legacy instrumentations:

For HttpClient, -- we should be able to modify TraceFlags and have HttpClient propage correct traceparent, as no code in HttpClient itself is accessing Activity.Id before we get a chance to update TraceFlags.
For Azure Clients, have to see if they access .Id before application insights has a chance to modify traceflags. Since Azure SDKs are already in the path to ActivitySource migration (https://devblogs.microsoft.com/azure-sdk/introducing-experimental-opentelemetry-support-in-the-azure-sdk-for-net/), this should not be a long-term concern.

vishweshbankwar · 2022-04-25T19:18:17Z

Adding initial analysis (needs to be validated).

Two possible approaches:

Both the approaches need accessing TelemetryConfiguration object to get TelemetryProcessorChain

ASP.NET Core (HttpClient): configuration field inside https://github.com/microsoft/ApplicationInsights-dotnet/blob/main/WEB/Src/DependencyCollector/DependencyCollector/HttpCoreDiagnosticSourceListener.cs
ASP.NET (NETFramework): Can be accessed via TelemetryClient.TelemetryConfiguration inside

ApplicationInsights-dotnet/WEB/Src/DependencyCollector/DependencyCollector/Implementation/ClientServerDependencyTracker.cs

Line 26 in 1e945be

{
Azure Clients: Can be accessed via base.TelemetryClient.TelemetryConfiguration inside https://github.com/microsoft/ApplicationInsights-dotnet/blob/main/WEB/Src/DependencyCollector/DependencyCollector/Implementation/AzureSdk/AzureSdkDiagnosticsEventHandler.cs

Approach 1:
Find the sampling processor which will be applied on Dependency telemetry.

From the TelemetryProcessorChain get all processors.
Find the sampling processors (type is SamplingTelemetryProcessor or AdaptiveSamplingTelemetryProcessor)
Find the first sampling processor which will be applied to the telemetry.

a. If the first sampling processor is of SamplingTelemetryProcessor which includes Dependency Telemetry

Get the SamplingPercentage to calculate score and set TraceFlags.
As this is a case of fixed rate sampling, the TraceFlags will be accurate(see limitation).

b. If the first sampling processor is of type AdaptiveSamplingTelemetryProcessor which includes Dependency telemetry.

Get the SamplingPercentage from SamplingTelemetryProcessor of that AdaptiveSamplingTelemetryProcessor.
At present, there is no direct way to get this as SamplingTelemetryProcessor property on AdaptiveSamplingTelemetryProcessor is
internal

Limitations:

If the telemetry percentage is set in telemetryinitializer then,
there is no way to know that percentage as it depends on user implementation.
In this case the trace flags will be set using incorrect percentage.

Approach 2:
Use the ExperimentalFeature

For each UseAdaptiveSampling overload check if Dependency telemetry is applicable. e.g. There can be multiple samplers:
builder.UseAdaptiveSampling(maxTelemetryItemsPerSecond:5, "Dependency", "Request"); // Dependency is excluded
builder.UseAdaptiveSampling(maxTelemetryItemsPerSecond:9, "Request", "Dependency"); // Request is excluded
When we find the first sampler which includes Dependency add a Sampling CallBack to AdaptiveSamplingTelemetryProcessor:
a. This involves changing implementation for each overload of UseAdaptiveSampling and adding sampling callback even if one was not provided by user.
b. On getting a callback update LastObservedSamplingPercentage for Dependency telemetry.
c. Use this to evaluate score and set traceflags.
We still need to look up the processors in TelemetryProcessorChain as it may contain FixedRate sampler before AdaptiveSamplers.

Limitations:

This will not work if user has provided their own sampling callback while adding UseAdaptiveSampling().
Same limitation as approach 1 where percentage is changed using telemetry initializer.

github-actions · 2023-02-20T00:04:39Z

This issue is stale because it has been open 300 days with no activity. Remove stale label or this will be closed in 7 days. Commenting will instruct the bot to automatically remove the label.

cijothomas added enhancement distributed-tracing labels Jan 22, 2021

cijothomas assigned vishweshbankwar May 27, 2021

github-actions bot added the stale label Feb 20, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 27, 2023

cijothomas mentioned this issue May 4, 2023

How to make activities created by ASP.NET Core have the property Recorded equal to true with the Application Insights SDK #2775

Closed

cijothomas mentioned this issue Dec 8, 2023

TraceParent Flag -00 (no sample) #2835

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagating SamplingFlags as per W3CTraceContext guidelines. #2156

Propagating SamplingFlags as per W3CTraceContext guidelines. #2156

cijothomas commented Jan 22, 2021

zakimaksyutov commented Jan 27, 2021

vishweshbankwar commented Nov 19, 2021 •

edited

Loading

cijothomas commented Nov 19, 2021

cijothomas commented Nov 22, 2021

vishweshbankwar commented Apr 25, 2022 •

edited

Loading

github-actions bot commented Feb 20, 2023

Propagating SamplingFlags as per W3CTraceContext guidelines. #2156

Propagating SamplingFlags as per W3CTraceContext guidelines. #2156

Comments

cijothomas commented Jan 22, 2021

zakimaksyutov commented Jan 27, 2021

vishweshbankwar commented Nov 19, 2021 • edited Loading

cijothomas commented Nov 19, 2021

cijothomas commented Nov 22, 2021

vishweshbankwar commented Apr 25, 2022 • edited Loading

github-actions bot commented Feb 20, 2023

vishweshbankwar commented Nov 19, 2021 •

edited

Loading

vishweshbankwar commented Apr 25, 2022 •

edited

Loading