Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider releasing #107333 into dotnet 9 #109060

Closed
djluck opened this issue Oct 20, 2024 · 38 comments
Closed

Consider releasing #107333 into dotnet 9 #109060

djluck opened this issue Oct 20, 2024 · 38 comments

Comments

@djluck
Copy link

djluck commented Oct 20, 2024

We've been observing high lock contention in ActivitySource when it comes to publishing activity start/ stop notifications. #107333 fixes this issue but I'm unable to comment in this PR (it was locked).

@tarekgh in that MR you outlined the higher threshold for merging this change into v9, citing impact. In our case we are seeing 10-20% of CPU time being contented in some of our high throughput services, contributing to severe latency spikes. Our hardware can have up to 192 cores which means we're far more likely to experience contention around a single shared mutex.

Hopefully this is data point is enough to consider merging this change into v9.

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 20, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 20, 2024
@vcsjones vcsjones added area-System.Diagnostics.Activity and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 20, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-diagnostics-activity
See info in area-owners.md if you want to be subscribed.

@tarekgh
Copy link
Member

tarekgh commented Oct 21, 2024

@djluck the bar is really very high now. I'll check if this can be justified at the current stage of the release. Or maybe we can look having it in the service release.

Could you please help answering the following questions:

You said In our case we are seeing 10-20% of CPU time being contented in some of our high throughput services,, did you check if the issue we have fixed is the root cause of that or could be other things contributing to that? how did you validate that? Why did you start seeing that now? Is there a way you test your scenario with .NET 10 https://github.com/dotnet/sdk/blob/main/documentation/package-table.md and validate the fix is really helping.

@tarekgh
Copy link
Member

tarekgh commented Oct 21, 2024

CC @noahfalk @ericstj

@djluck
Copy link
Author

djluck commented Oct 21, 2024

Thanks for the quick reply! In response:

did you check if the issue we have fixed is the root cause of that or could be other things contributing to that?

We took traces of our application and saw excessive CPU due to lock contention in the exact function this PR addresses. I see no other mutex use within the start/ stop code path and so I believe it can only be the listener invocations. A sample trace we captured:

How did you validate that?

See this trace:
Image

Why did you start seeing that now?

We're relatively new to tracing within our organization and have been increasing adoption within our large scale service over the last few months.

Is there a way you test your scenario with .NET 10 https://github.com/dotnet/sdk/blob/main/documentation/package-table.md and validate the fix is really helping.

Is there a docker image available currently? One way I've been able to verify the lack of mutex impact is to create multiple ActivitySource's and distribute Start() calls across them- this has eliminated the contention we are seeing:

public class ActivitySources
    {
        private int _next = 0; 
        private ImmutableArray<ActivitySource> _sources;


        public ActivitySources(ActivitySpanSubsystemConfig config)
        {
            _sources = Enumerable.Range(1, 2048).Select(n => new ActivitySource(config.ActivitySourceName + n))
                .ToImmutableArray();
        }

        public ActivitySource Get()
        {
            return _sources[Interlocked.Increment(ref _next) % _sources.Length];

        }
    }

@tarekgh
Copy link
Member

tarekgh commented Oct 21, 2024

@djluck thanks for the info.

Is there a docker image available currently?

Could you try please reference the System.Diagnostics.DiagnosticSource package version 10.0.0-alpha.1.24521.2 from the NuGet feed https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet10/nuget/v3/index.json.
https://dev.azure.com/dnceng/public/_artifacts/feed/dotnet10/NuGet/System.Diagnostics.DiagnosticSource/overview/10.0.0-alpha.1.24521.2

I'll wait your validation then I'll try to check if we can get the approval for 9.0.

@tarekgh tarekgh added this to the 9.0.0 milestone Oct 21, 2024
@tarekgh tarekgh removed the untriaged New issue has not been triaged by the area owner label Oct 21, 2024
@djluck
Copy link
Author

djluck commented Oct 22, 2024

Thanks for giving me the pre release package feed, I've built a test version of our application with it and will test it tomorrow.

@djluck
Copy link
Author

djluck commented Oct 23, 2024

Unfortunately the testing has hit a wall: we're seeing MissingMethodExceptions:

 Connection id "0HN7ISM12NH0N", Request id "0HN7ISM12NH0N:00000009": An unhandled exception was thrown by the application.
ERROR:One or more errors occurred. (Method not found: 'Void System.Diagnostics.ActivityCreationOptions`1.set_TraceState(System.String)'.) || {"ExceptionStr":"System.AggregateException: One or more errors occurred. (Method not found: 'Void System.Diagnostics.ActivityCreationOptions`1.set_TraceState(System.String)'.)\n ---> System.MissingMethodException: Method not found: 'Void System.Diagnostics.ActivityCreationOptions`1.set_TraceState(System.String)'.\n   at OpenTelemetry.Trace.TracerProviderSdk.ComputeActivitySamplingResult(ActivityCreationOptions`1& options, Sampler sampler)\n   at System.Diagnostics.ActivitySource.<>c.<CreateActivity>b__23_1(ActivityListener listener, ActivityCreationOptions`1& data, ActivitySamplingResult& result, ActivityCreationOptions`1& unused)\n   at System.Diagnostics.ActivitySource.CreateActivity(String name, ActivityKind kind, ActivityContext context, String parentId, IEnumerable`1 tags, IEnumerable`1 links, DateTimeOffset startTime, Boolean startIt, ActivityIdFormat idFormat)\n   at TTD.Common.TelemetryLib.ActivitySpans.ActivitySpan.Start(IEnumerable`1 tags, ActivityContext parentContext, IEnumerable`1 links, String callerName, String callerFilePath, Int32 callerLineNum)

I'm not 100% why we're getting this:

It's possible it might be related to the fact we're targeting net 6.0 for our application? I haven't decompiled the netstandard2 versions of all the DLLs but there might be some differences here?

@tarekgh
Copy link
Member

tarekgh commented Oct 23, 2024

This is very weird, and I have no idea how this happened. I downloaded the NET10 package 10.0.0-alpha.1.24521.2 and opened the ns2.0 library and I am seeing the init of the ActivityCreationOptions.TraceState is missing. This shouldn't happen. I built the package locally and opened the ns2.0 library and I am seeing the init is there.

Could you please try the recent version of the package from https://dev.azure.com/dnceng/public/_artifacts/feed/dotnet10/NuGet/System.Diagnostics.DiagnosticSource/overview/10.0.0-alpha.1.24522.5? I confirmed this version doesn't have this problem.

@djluck
Copy link
Author

djluck commented Oct 23, 2024

Thanks for digging in, I'll give that package a try 👍

@djluck
Copy link
Author

djluck commented Oct 23, 2024

Still seeing this fail with the same exception. Can reproduce in our tests. Decompiled the package and I can see the method in there and additional reflected on the type with this code to see the method:

var t = typeof(ActivitySource).Assembly.DefinedTypes.Where(x => x.Name.Contains("Creation")).Single().MakeGenericType(typeof(Activity));
// values were {System.String get_TraceState(), Void set_TraceState(System.String), Void SetTraceState(System.String)}
var methods = t.GetMethods(BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic).Where(x => x.Name.Contains("TraceState"))

@tarekgh
Copy link
Member

tarekgh commented Oct 23, 2024

I confirm the init method is there for sure. Is it possible you can create a small repro project I can look at?

@djluck
Copy link
Author

djluck commented Oct 23, 2024

Sure: repro.zip

@djluck
Copy link
Author

djluck commented Oct 23, 2024

I confirmed that when I target net8.0, the issue disappears.

@tarekgh
Copy link
Member

tarekgh commented Oct 23, 2024

I confirmed that when I target net8.0, the issue disappears.

Thanks for trying that and sorry for the inconvenience. Will you be able to test the perf with that?

@djluck
Copy link
Author

djluck commented Oct 23, 2024

Unfortunately not, we currently still target 6.0 for the majority of our applications. Upgrading to 8.0 will be a time consuming process that will involve many weeks (if not months) of testing.

@tarekgh
Copy link
Member

tarekgh commented Oct 23, 2024

Ok, I'll look at your shared repro and try to find out what the problem is.

In your project, you are targeting 6.0 but not referencing the specific version of System.Diagnostics.DiagnosticSource package. right? I mean your app is just using the 6.0 inbox version.

@djluck
Copy link
Author

djluck commented Oct 24, 2024

I am referencing the specific version of System.Diagnostics.DiagnosticSource. This is the project file contents:

<Project Sdk="Microsoft.NET.Sdk">

    <PropertyGroup>
        <OutputType>Exe</OutputType>
        <TargetFramework>net6.0</TargetFramework>
        <ImplicitUsings>enable</ImplicitUsings>
        <Nullable>enable</Nullable>
    </PropertyGroup>

    <ItemGroup>
      <PackageReference Include="OpenTelemetry.Exporter.Console" Version="1.9.0" />
      <PackageReference Include="System.Diagnostics.DiagnosticSource" Version="10.0.0-alpha.1.24521.2" />
    </ItemGroup>

</Project>

@tarekgh
Copy link
Member

tarekgh commented Oct 24, 2024

So far, I am seeing the problem is in the ns2.0 lib. I am trying to dig more to understand why that is. The diagnostic source package in net9 and net10 removed net6 target. which forcing using ns2.0 when targeting net6 and there the issue will show up.

@tarekgh
Copy link
Member

tarekgh commented Oct 24, 2024

@ericstj helped explain the problem we are seeing. In short, the property ActivityCreationOptions.TraceState is using init which depends on another internal type System.Runtime.CompilerServices.IsExternalInit which is type forwarded in the diagnostic source library and that is what is causing the problem when using the ns2.0 library on net6.0.

I'll try to build you a private package based on net8.0 which still keeps net6 targeting. I'll share that with you when I have it ready. Thanks!

@tarekgh
Copy link
Member

tarekgh commented Oct 24, 2024

System.Diagnostics.DiagnosticSource.8.0.1.nupkg.zip

@djluck store the attached package to some local folder then rename it (remove the .zip part from the name). then clear your .nuget package cache just in case to force using the new one. Then edit your nuget.config to point at local folder. After restoring, please ensure the right package/library is the one restored. I already confirmed the package has net6.0 library and contains the fix.

Aside from that, I have a question for you. If your app is targeting net 6.0 and you are requesting the fix to be included in net 9.0. Are you planning to migrate to 9.0 when it is released?

@djluck
Copy link
Author

djluck commented Oct 24, 2024

Thanks, I will give that package a test.

Aside from that, I have a question for you. If your app is targeting net 6.0 and you are requesting the fix to be included in net 9.0. Are you planning to migrate to 9.0 when it is released?

My understanding is that System.Diagnostics.DiagnosticSource would be published on nuget at version 9.*, allowing us to reference this and avoid updating to v9 of the runtime. Is this correct?

@tarekgh
Copy link
Member

tarekgh commented Oct 25, 2024

My understanding is that System.Diagnostics.DiagnosticSource would be published on nuget at version 9.*, allowing us to reference this and avoid updating to v9 of the runtime. Is this correct?

The net9.0 package is not going to have net6.0 target. This means you'll end up with the same problem when falling back to ns2.0 one.

@djluck
Copy link
Author

djluck commented Oct 25, 2024

Right, thanks for making that clear. But isn't this a violation of netstandard2 support? As System.Diagnostics.DiagnosticSource targets netstandard2 and net6.0 supports this, this seems like an error on the side of the package.

@tarekgh
Copy link
Member

tarekgh commented Oct 25, 2024

But isn't this a violation of netstandard2 support?

No, net6.0 will be out of support by Nov 12th (the time of 9.0 release). This is why we removed the net6 assets from the package. This is why you will get a warning when using such package while targeting net6.

@djluck
Copy link
Author

djluck commented Oct 25, 2024

I can't say I fully understand why this doesn't work with the netstandard2 targeted DLLs given .NET 6.0 supports netstandard2. But if this is by design, let me chase up people to try and understand if/ when we're upgrading to .NET 8.

@tarekgh
Copy link
Member

tarekgh commented Oct 25, 2024

I can't say I fully understand why this doesn't work with the netstandard2 targeted DLLs given .NET 6.0 supports netstandard2.

Using init compiler feature requires the type IsExternalInit be defined. This type is introduced in .NET 5.0. System.Diagnostics.DiagnosticSource library when targeting .NET (like net6.0 and up), it just does type forwarding this type as it is defined in the System.Runtime of the runtime. If targeting downlevel platforms e.g. netfx and netstandard, the library defines IsExternalInit as internal type inside the library. This works fine if having net6 assets inside the library which will do type forwarding.
If using netstandard version on net6.0 will cause to have the same type defined in two places and that cause the conflict problem.

if this is by design

Yes, as mentioned before, net6.0 is not going to be supported by the release of net9. When referencing this library package in net6, you'll get a compilation warning that this is not supported any more in net6. You have the option either to stick with the older version of the package and continue having the same behavior as before. Or you choose to upgrade to recent .NET versions and take advantage of all the new features and fixes in the newer library package.

Let me know if there is anything unclear that I can help clarifying it.

@tarekgh
Copy link
Member

tarekgh commented Oct 25, 2024

By the way, I appreciate if you continue trying the private package https://github.com/user-attachments/files/17512716/System.Diagnostics.DiagnosticSource.8.0.1.nupkg.zip in your scenario and report if it is fixing the issue you have.

@djluck
Copy link
Author

djluck commented Oct 28, 2024

Hey @tarekgh, I tested the package successfully- we saw contention eliminated around ActivitySource calls.

When running a trace on another machine not running the fix, we see the high levels of contention around the SynchronizedList class:
Image

When tracing a machine running the new System.Diagnostics.DiagnosticSource package, we see that contention is basically eliminated:
Image

@tarekgh tarekgh modified the milestones: 9.0.0, 9.0.1, 9.0.x Oct 28, 2024
@tarekgh
Copy link
Member

tarekgh commented Oct 28, 2024

@djluck thanks a lot for testing it out and sending the number. I'll look at and request porting the fix to the net9 first servicing release after the official release. From your side, please let us know if you can upgrade your .NET targeting and your time frame. I am asking because I am interested in helping you and resolving your issue.

@djluck
Copy link
Author

djluck commented Oct 28, 2024

Thanks @tarekgh, appreciate all the timely help you've offered so far. I found out that we are already on net8.0 for one of our largest services and we're planning to migrate the rest Q1 2025. When would the 1st servicing release be due?

@tarekgh
Copy link
Member

tarekgh commented Oct 28, 2024

Good to learn you are migrating to net8.

When would the 1st servicing release be due?

The exact timing is not decoded yet. My guess that it will be around Jan 2025.

@djluck
Copy link
Author

djluck commented Oct 29, 2024

Thanks @tarekgh. Just to make sure I understand:

  • This fix may make it in to the net9 servicing release
  • If that happens, it may happen around Q1 2025

@tarekgh
Copy link
Member

tarekgh commented Oct 29, 2024

@djluck yes, your understanding is correct.

@tarekgh
Copy link
Member

tarekgh commented Nov 5, 2024

@djluck the fix is now merged to the 9.0 servicing branch. Let me know if you have any questions, I can help with them.

@djluck
Copy link
Author

djluck commented Nov 5, 2024

Thanks for the update @tarekgh! We'll keep an eye out for the servicing release.

@djluck djluck closed this as completed Nov 5, 2024
@djluck
Copy link
Author

djluck commented Nov 5, 2024

Thanks again for all your help getting to this point, really appreciate it.

@djluck
Copy link
Author

djluck commented Nov 6, 2024

@tarekgh just to confirm, this isn't going to be part of the 9.0 RC that's currently available on nuget? This will only be released as part of the servicing release?

@tarekgh
Copy link
Member

tarekgh commented Nov 7, 2024

You are correct. This fix will not show up till the first 9.0 servicing release. I'll expect the package version will be like 9.0.1 at that time.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants