Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove thread contention from Activity Start/Stop #107333

Conversation

AlgorithmsAreCool
Copy link
Contributor

@AlgorithmsAreCool AlgorithmsAreCool commented Sep 4, 2024

Fixes #107158

This change tries to resolve a performance issue related to thread contention when starting and stopping Activities. The new approach is a lock-free, copy-on-write array design which eliminates all thread contention and improves the throughput of Activity.Start/Stop (based on local testing).

I would like to add some basic tests to the SynchronizedList<T> class, but I'm unsure what the policy/procedure is for testing internal classes

I would also like to make sure that this PR has the desired effect of improving request throughput in key benchmarks when telemetry is enabled.

The problem

The existing code was causing lock contention between request threads that were starting and stopping activities. The approach was to lock and unlock for each item in the collection. So if there were 10 listeners, the request thread would have to acquire and release 10 locks, each of which could contend with other request threads.

Proposed solution

Use "immutable" arrays of listeners that can be acquired by a request thread without locking. This should make the common/hot path (starting/stopping activities) faster when the system is under load. It should, however, also have a small negative impact on adding/removing listeners, which now becomes a O(N) operation instead of the O(~1) that it was before. However in practice this should have essentially no impact on application performance or allocations

Local Testing

My methodology was to start multiple threads (16 vCore machine) and to Start/Stop Activities in a hot loop (worst case).

  • 10 worker/request threads
  • 50,000 start/dispose activities per thread
  • 10 ActivityListeners subscribed up front and not mutated
  • 10 runs of benchmark

Results Before Changes

Benchmark execution time 00:00:15.1899827

image

Results After Changes

Benchmark execution time 00:00:01.5148120

image

Microbenchmarks

  • With Prometheus metrics exporter
    • AddRuntimeInstrumentation
  • 1 synthetic/null tracing/actvity processor
  • 1 synthetic activity listener (to verify invocation counts)
  • 50,000 activity start/stops per thread
Thread Count Methodology Max Contentions Min Contentions Average Contentions Average Execution Time (s)
1 Baseline 0 0 0 0.04198
2 Baseline 1 0 0.125 0.05053
4 Baseline 9 18 3.406 0.081563
8 Baseline 2168 1472 1743.125 0.300675
16 Baseline 4436 3622 4005.375 0.6288
1 New 0 0 0 0.039686
2 New 0 0 0 0.041909
4 New 0 0 0 0.047901
8 New 0 0 0 0.063951
16 New 0 0 0 0.073664

Open Telemetry Stress test results

Before

opentelemetry-dotnet\test\OpenTelemetry.Tests.Stress.Traces on  main via .NET v8.0.100 🎯 $(TargetFrameworksForTests) took 1m17s
❯ dotnet run --framework net8.0 --configuration Release -- -d 30
Options: {"Concurrency":16,"PrometheusInternalMetricsPort":9464,"DurationSeconds":30}
Run OpenTelemetry.Tests.Stress.Traces.exe --help to see available options.
Running (concurrency = 16, internalPrometheusEndpoint = http://localhost:9464/metrics/), press <Esc> to stop, press <Spacebar> to toggle statistics in the console...
Loops: 41,878,519, Loops/Second: 1,490,815, CPU Cycles/Loop: 10,920, RemainingTime (Seconds): 0
Stopping the stress test...
* Total Running Time (Seconds) 30
* Total Loops: 42,160,488
* Average Loops/Second: 1,404,928
* Average CPU Cycles/Loop: 11,512
* GC Total Allocated Bytes: 20948591184

After

opentelemetry-dotnet\test\OpenTelemetry.Tests.Stress.Traces on  main via .NET v8.0.100 🎯 $(TargetFrameworksForTests)
❯ dotnet run --framework net8.0 --configuration Release -- -d 30
Options: {"Concurrency":16,"PrometheusInternalMetricsPort":9464,"DurationSeconds":30}
Run OpenTelemetry.Tests.Stress.Traces.exe --help to see available options.
Running (concurrency = 16, internalPrometheusEndpoint = http://localhost:9464/metrics/), press <Esc> to stop, press <Spacebar> to toggle statistics in the console...
Loops: 451,278,899, Loops/Second: 13,309,562, CPU Cycles/Loop: 1,687, RemainingTime (Seconds): 0
Stopping the stress test...
* Total Running Time (Seconds) 30
* Total Loops: 454,080,963
* Average Loops/Second: 15,048,749
* Average CPU Cycles/Loop: 1,665
* GC Total Allocated Bytes: 225228374680

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 4, 2024
@AlgorithmsAreCool
Copy link
Contributor Author

@dotnet-policy-service agree

@En3Tho
Copy link
Contributor

En3Tho commented Sep 4, 2024

I'm personally not sure if interlocked complexity is worth over a dedicated mutation (Add/Remove) lock? Class will have 1 more field though.

@ilmax
Copy link

ilmax commented Sep 4, 2024

All those Array.Copy will give the GC some more work to do, I'm curious to know what's the net effect of this change on the affected benchmarks, do you have some numbers already?

@AlgorithmsAreCool
Copy link
Contributor Author

I'm personally not sure if interlocked complexity is worth over a dedicated mutation (Add/Remove) lock? Class will have 1 more field though.

I would actually argue this is simpler than the previous approach which was using some very tricky looking interleaved locking that I couldn't trivially prove were correct. Also, the excessive locking was causing quite a bit of contention, i'll include details in the PR description

@En3Tho
Copy link
Contributor

En3Tho commented Sep 4, 2024

No I mean you don't really need interlocked operations to update array because adding/removing listeners is expected to be rare and shouldn't be a contention cause. Otherwise even with interlocked there will be a contention, just not apparent/reported.

The idea here is to remove locks/waits from actual listener iteration (which you did), but it doesn't matter for add/remove (because this behavioral change you made with removing version does not care about that anymore)

@AlgorithmsAreCool
Copy link
Contributor Author

All those Array.Copy will give the GC some more work to do, I'm curious to know what's the net effect of this change on the affected benchmarks, do you have some numbers already?

Yes, this will be a performance regression for adding and removing listeners. But with a few caveats

  • Adding/Removing listeners should be relatively infrequent. These allocations should only begin to matter if
    1. Somone is adding / removing listeners more than a few dozen times per second (which sounds crazy)
    2. Someone has more than a few dozen listeners and is still churning the list continuously
  • The hot path is broadcasting Activity notifications to listeners, which should be faster in this approach

@AlgorithmsAreCool
Copy link
Contributor Author

No I mean you don't really need interlocked operations to update array because adding/removing listeners is expected to be rare and shouldn't be a contention cause. Otherwise even with interlocked there will be a contention, just not apparent/reported.

The idea here is to remove locks/waits from actual listener iteration (which you did), but it doesn't matter for add/remove

I'm not sure I understand. How else besides locking or Interlocked could I guarantee that the Activity Start/Stop thread will not see a torn list of listeners if a listener is added/removed? And the use of locks was causing contention not with the threads that were mutating the collection, but with the various worker threads that were starting/stopping Activities which is on the main path

Author:    algorithmsarecool <algorithmsarecool@gmail.com>
Author:    algorithmsarecool<algorithmsarecool@gmail.com>
- Reduce duplication
- add comments and make code more obvious
- Use IndexOf

Author:    algorithmsarecool <algorithmsarecool@gmail.com>
@cijothomas
Copy link
Contributor

All those Array.Copy will give the GC some more work to do, I'm curious to know what's the net effect of this change on the affected benchmarks, do you have some numbers already?

Yes, this will be a performance regression for adding and removing listeners. But with a few caveats

  • Adding/Removing listeners should be relatively infrequent. These allocations should only begin to matter if

    1. Somone is adding / removing listeners more than a few dozen times per second (which sounds crazy)
    2. Someone has more than a few dozen listeners and is still churning the list continuously
  • The hot path is broadcasting Activity notifications to listeners, which should be faster in this approach

+1. Adding/Removing listeners is a very rare operation.

@cijothomas
Copy link
Contributor

@AlgorithmsAreCool If you'd like to, you can run the stress test tool built by OpenTelemetry for testing OTel itself!
https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/test/OpenTelemetry.Tests.Stress.Traces - The stress test shows the throughput, and it should show an improvement, when applying changes from this PR.

@AlgorithmsAreCool
Copy link
Contributor Author

Will do!

@cijothomas
Copy link
Contributor

cijothomas commented Sep 4, 2024

@AlgorithmsAreCool If you'd like to, you can run the stress test tool built by OpenTelemetry for testing OTel itself! https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/test/OpenTelemetry.Tests.Stress.Traces - The stress test shows the throughput, and it should show an improvement, when applying changes from this PR.

In my dev box (16 logical cores), the stress test for Traces is showing a throughput of ~2M/sec. Whereas the one for Logs show 50M/sec. A good chunk of this difference is likely the contention being fixed here. The stress test show 0 contention for Logs as Logs has no feature of "listeners"! Hopefully, with this change, Traces should see a huge jump in throughput!

@AlgorithmsAreCool
Copy link
Contributor Author

@AlgorithmsAreCool If you'd like to, you can run the stress test tool built by OpenTelemetry for testing OTel itself! https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/test/OpenTelemetry.Tests.Stress.Traces - The stress test shows the throughput, and it should show an improvement, when applying changes from this PR.

In my dev box (16 logical cores), the stress test for Traces is showing a throughput of ~2M/sec. Whereas the one for Logs show 50M/sec. A good chunk of this difference is likely the contention being fixed here. The stress test show 0 contention for Logs as Logs has no feature of "listeners"! Hopefully, with this change, Traces should see a huge jump in throughput!

I have run the stress test on my local machine and I show a very large jump in performance (1.4M/s -> 15M/s)

@AlgorithmsAreCool
Copy link
Contributor Author

@dotnet/area-system-diagnostics-activity I'm not sure how to proceed, the build failures as far as i can tell do not seem to be related

@tommcdon
Copy link
Member

tommcdon commented Sep 5, 2024

@dotnet/area-system-diagnostics-activity I'm not sure how to proceed, the build failures as far as i can tell do not seem to be related

Adding @elinor-fung @jkoritzinsky as the failures seem to be COM/binder source gen related

@jkoritzinsky
Copy link
Member

The com generator tests failure looks like a Helix infra failure from the logs.

@tarekgh tarekgh added this to the 10.0.0 milestone Sep 5, 2024
@tarekgh
Copy link
Member

tarekgh commented Sep 5, 2024

CC @noahfalk

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'll defer to @tarekgh for signing off.

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo left minor comments, LGTM. Thanks!

@AlgorithmsAreCool
Copy link
Contributor Author

I'm assuming someone else has to merge this?

@Kielek
Copy link

Kielek commented Sep 6, 2024

@tarekgh, based on what I see it is/will be targeted to .NET10. Do you have also plan to backport to .NET9 or it is to late?

@tarekgh
Copy link
Member

tarekgh commented Sep 6, 2024

@Kielek

based on what I see it is/will be targeted to .NET10. Do you have also plan to backport to .NET9 or it is to late?

From previous discussions parties reported this were ok to have in next release 10. Our current plan is to have it in .NET 10 to reduce any risk for .NET 9.0. Is it blocking any real scenario for you? Please elaborate more why you are asking to have the fix in 9.0.

@tarekgh tarekgh merged commit ad430a1 into dotnet:main Sep 6, 2024
83 of 85 checks passed
@AlgorithmsAreCool
Copy link
Contributor Author

@Kielek

based on what I see it is/will be targeted to .NET10. Do you have also plan to backport to .NET9 or it is to late?

From previous discussions parties reported this were ok to have in next release 10. Our current plan is to have it in .NET 10 to reduce any risk for .NET 9.0. Is it blocking any real scenario for you? Please elaborate more why you are asking to have the fix in 9.0.

For my team, it isn't necessarily blocking our use of tracing, but we operate at least 1 service that is very low latency that might notice this change.

I'm curious how much of the 15% impact to topline throughput this change wins back.

@AlgorithmsAreCool AlgorithmsAreCool deleted the algorithmsarecool/reduce-activity-contention branch September 6, 2024 17:50
@tarekgh
Copy link
Member

tarekgh commented Sep 6, 2024

@AlgorithmsAreCool This behavior has been present for a few releases and is not a regression. While improving performance is beneficial, we need a strong justification to accept this fix for .NET 9.0, as we aim to minimize risks in the release, especially since this change hasn't been tested in preview builds. I understand this isn't a blocker for your scenario. However, if the low latency issue in your service is significantly impacted by this and you feel it's critical to address it soon, we can consider including the fix in .NET 9.0.

jtschuster pushed a commit to jtschuster/runtime that referenced this pull request Sep 17, 2024
* Remove thread contention from Activity Start/Stop

Author:    algorithmsarecool <algorithmsarecool@gmail.com>

* Fix ref parameters and whitespace

Author:    algorithmsarecool<algorithmsarecool@gmail.com>

* PR feedback.
- Reduce duplication
- add comments and make code more obvious
- Use IndexOf

Author:    algorithmsarecool <algorithmsarecool@gmail.com>

* PR feedback to simplify locking strategy

* PR feedback, final nits
sirntar pushed a commit to sirntar/runtime that referenced this pull request Sep 30, 2024
* Remove thread contention from Activity Start/Stop

Author:    algorithmsarecool <algorithmsarecool@gmail.com>

* Fix ref parameters and whitespace

Author:    algorithmsarecool<algorithmsarecool@gmail.com>

* PR feedback.
- Reduce duplication
- add comments and make code more obvious
- Use IndexOf

Author:    algorithmsarecool <algorithmsarecool@gmail.com>

* PR feedback to simplify locking strategy

* PR feedback, final nits
@github-actions github-actions bot locked and limited conversation to collaborators Oct 7, 2024
@tarekgh
Copy link
Member

tarekgh commented Oct 29, 2024

/backport to 9.0-staging

@github-actions github-actions bot unlocked this conversation Oct 29, 2024
Copy link
Contributor

Started backporting to 9.0-staging: https://github.com/dotnet/runtime/actions/runs/11582206026

Copy link
Contributor

@tarekgh an error occurred while backporting to 9.0-staging, please check the run log for details!

The process '/usr/bin/git' failed with exit code 1

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 29, 2024
@tarekgh
Copy link
Member

tarekgh commented Oct 29, 2024

/backport to release/9.0-staging

@github-actions github-actions bot unlocked this conversation Oct 29, 2024
Copy link
Contributor

Started backporting to release/9.0-staging: https://github.com/dotnet/runtime/actions/runs/11582621574

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Diagnostics.Activity community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding OpenTelemetry instrumentation results in 15% RPS degradation
10 participants