Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagating SamplingFlags as per W3CTraceContext guidelines. #2156

Closed
cijothomas opened this issue Jan 22, 2021 · 6 comments
Closed

Propagating SamplingFlags as per W3CTraceContext guidelines. #2156

cijothomas opened this issue Jan 22, 2021 · 6 comments

Comments

@cijothomas
Copy link
Contributor

Background:
This SDK has adopted W3CTraceContext recommendation as the default for distributed context propagation. In recent version of .NET Core (3.1 and new), the propagation is done by the libraries (HttpClient, AspNetCore) themselves, and not by this SDK. In older versions, autocollectors in this repo takes care of propagation. Propagation is done with the support of Activity class, whose Activity.Id field corresponds to the W3CTraceContext header "traceparent".

Examples:
For HttpClient in .NET Core 3.1 and newer, the HttpClient library itself adds traceparent header to outbound calls, with
Activity.Current.Id as the value.
For HttpClient in .NET Core 2.1 the DependencyCollectionTelemetryModule adds traceparent header to outbound calls, with Activity.Current.Id as the value.

For Asp.Net Core 3.1 and newer, the Asp.Net Core framework itself extracts traceparent from incoming request headers, and
starts an Activity with incoming traceparent as its parent.
For Asp.Net Core 2.1, the RequestCollectionModule extracts traceparent from incoming request headers, and starts an Activity with incoming traceparent as its parent.

Current issue with context propagation and Sampling
Activity.Id (traceparent) is composed of "version-traceid-spanid-traceflags", with the traceflags currently supporting a single flag "Sampled". Depending on the sampling strategy, this flag may/may not be used. For example:, if an application makes "delayed" sampling decision, then it may not propagate the traceflags which indicates its sampling decision. For an application which receives an incoming request with traceflags, it can chose to ignore it for variety of reasons, like security considerations, diff. in load etc.

Application Insights follow a "tail-based" or "delayed" sampling model, where sampling decision is made after collecting the entire telemetry item. For example, a RequestTelemetry item is created at first, its fields are populated by the auto collector module, and all TelemetryInitializers are run which can further populate/entirch the telemetry. The sampler is run after these, as a TelemetryProcessor.
Because of its delayed sampling model, Application Insights does not respect the incoming trace flags, and does not modify traceflags when propagating to next node.

This causes issues when interoping with other systems, which follow a different algorithm. Due to lack of the support of sampling traceflags, its not possible to coordinate sampling decisions between 2 nodes operating different sampling algorithms

Proposal

Switch ApplicationInsights to head-based sampling. This shift can allow ApplicationInsights to respect the incoming traceflags for its sampling decision, and also allow it to propagate its sampling decision to next node.

However, this is a major behavior change in ApplicationInsights, and can be done only with a major version bump to 3.0

Alternate proposal

Modify Sampled flag, based on the "expected" probability of the telemetry being sampled in.
This involves calculating the samplingscore and comparing it with current_sampling_rate, and adjusting Sampling flag accordingly before the request is sent.
For example:
While Http call is being made, the Activity's TraceId is used to calculate SamplingScore, and compare it with SamplingPercentage at that time. If the score is < samplingpercentage, modify SamplingFlags to Sampled. Propagate traceparent as before.
It is possible that, the actual sampling decision may be to drop this item. This can occur if too many other telemetry items were produced/samplingpercentage got changed in between. As per W3CTrace-Context spec, this is okay.

This is not a breaking change, so can be included in 2.X version itself.

It involves the following:

  1. From dependency collector, obtain the current Activity's TraceId, and run SamplingScore generator and compare the current sampling percentage for that telemetry type. (Sampler can have different sampling percentage for Request and Dependency)
  2. Make the sampling decision, and set SamplingFlags accordingly in the current Activity.
  3. Store the calculated samplingscore somewhere so that it can be re-used when the actual Sampler runs.
  4. Modify the Samplers to look for pre-calculated sampling score and use it, if exists. Else proceed as before.
  5. Confirm that the modified Activity.Id is the propagated to the next hop for the following:
    HttpClient (.NET Core) - Yes
    HttpClient (.NET Fw) - Yes
    Azure clients - To be investigated.
@zakimaksyutov
Copy link
Member

Looks good to me. Alternative proposal is backward compatible and addresses the issue.

@vishweshbankwar
Copy link

vishweshbankwar commented Nov 19, 2021

Updating traceflags on activity will not fix this issue entirely as we rely on HttpClient and Azure Clients to propagate traceparent header. In both the libraries traceparent is set using Activity.Id which does not reflect the updated traceflags on activity.
Reference Issue: dotnet/runtime#61857.

@cijothomas
Copy link
Contributor Author

Updating traceflags on activity will not fix this issue entirely as we rely on HttpClient and Azure Clients to propagate traceparent header. In both the libraries traceparent is set using Activity.Id which does not reflect the updated traceflags on activity. Reference Issue: dotnet/runtime#61857.

Few workarounds:

  1. Instead of relying on HttpClient, propagate traceparent ourselves, by re-constructing traceparent by hand, instead of relying on activity.Id, reflecting modified TraceFlags.
    This may not be possible for Azure Clients.
  2. Do no access Activity.Id. (Might break if someone else writes a listener, and accesses it..)
  3. [Hack] Use reflection and force Activity.Id to recalculate .Id, after setting TraceFlags ourselves.

@cijothomas
Copy link
Contributor Author

At the root of the issue is the fact that we are trying to modify TraceFlags for an activity that has already started. This is only an issue for "legacy" instrumentations which creates activity the legacy way (i.e new Activity().Start())., and not an issue for ActivitySource/OpenTelemetry way, as the sampler's are invoked before the activity gets created, and hence TraceFlags are correctly set.

Given the fact that this issue is only affecting legacy instrumentations:

  1. For HttpClient, -- we should be able to modify TraceFlags and have HttpClient propage correct traceparent, as no code in HttpClient itself is accessing Activity.Id before we get a chance to update TraceFlags.

  2. For Azure Clients, have to see if they access .Id before application insights has a chance to modify traceflags. Since Azure SDKs are already in the path to ActivitySource migration (https://devblogs.microsoft.com/azure-sdk/introducing-experimental-opentelemetry-support-in-the-azure-sdk-for-net/), this should not be a long-term concern.

@vishweshbankwar
Copy link

vishweshbankwar commented Apr 25, 2022

Adding initial analysis (needs to be validated).

Two possible approaches:

Both the approaches need accessing TelemetryConfiguration object to get TelemetryProcessorChain

  1. ASP.NET Core (HttpClient): configuration field inside https://github.com/microsoft/ApplicationInsights-dotnet/blob/main/WEB/Src/DependencyCollector/DependencyCollector/HttpCoreDiagnosticSourceListener.cs
  2. ASP.NET (NETFramework): Can be accessed via TelemetryClient.TelemetryConfiguration inside
  3. Azure Clients: Can be accessed via base.TelemetryClient.TelemetryConfiguration inside https://github.com/microsoft/ApplicationInsights-dotnet/blob/main/WEB/Src/DependencyCollector/DependencyCollector/Implementation/AzureSdk/AzureSdkDiagnosticsEventHandler.cs

Approach 1:
Find the sampling processor which will be applied on Dependency telemetry.

  1. From the TelemetryProcessorChain get all processors.
  2. Find the sampling processors (type is SamplingTelemetryProcessor or AdaptiveSamplingTelemetryProcessor)
  3. Find the first sampling processor which will be applied to the telemetry.

a. If the first sampling processor is of SamplingTelemetryProcessor which includes Dependency Telemetry

  • Get the SamplingPercentage to calculate score and set TraceFlags.
  • As this is a case of fixed rate sampling, the TraceFlags will be accurate(see limitation).

b. If the first sampling processor is of type AdaptiveSamplingTelemetryProcessor which includes Dependency telemetry.

  • Get the SamplingPercentage from SamplingTelemetryProcessor of that AdaptiveSamplingTelemetryProcessor.
  • At present, there is no direct way to get this as SamplingTelemetryProcessor property on AdaptiveSamplingTelemetryProcessor is
    internal

Limitations:

  1. If the telemetry percentage is set in telemetryinitializer then,
    there is no way to know that percentage as it depends on user implementation.
    In this case the trace flags will be set using incorrect percentage.

Approach 2:
Use the ExperimentalFeature

  1. For each UseAdaptiveSampling overload check if Dependency telemetry is applicable. e.g. There can be multiple samplers:
    builder.UseAdaptiveSampling(maxTelemetryItemsPerSecond:5, "Dependency", "Request"); // Dependency is excluded
    builder.UseAdaptiveSampling(maxTelemetryItemsPerSecond:9, "Request", "Dependency"); // Request is excluded

  2. When we find the first sampler which includes Dependency add a Sampling CallBack to AdaptiveSamplingTelemetryProcessor:
    a. This involves changing implementation for each overload of UseAdaptiveSampling and adding sampling callback even if one was not provided by user.
    b. On getting a callback update LastObservedSamplingPercentage for Dependency telemetry.
    c. Use this to evaluate score and set traceflags.

  3. We still need to look up the processors in TelemetryProcessorChain as it may contain FixedRate sampler before AdaptiveSamplers.

Limitations:

  1. This will not work if user has provided their own sampling callback while adding UseAdaptiveSampling().
  2. Same limitation as approach 1 where percentage is changed using telemetry initializer.

@github-actions
Copy link

This issue is stale because it has been open 300 days with no activity. Remove stale label or this will be closed in 7 days. Commenting will instruct the bot to automatically remove the label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants