Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Histogram synchronization issue #3534

Conversation

utpilla
Copy link
Contributor

@utpilla utpilla commented Aug 3, 2022

Changes

  • Use the same "locking" mechanism for Update and TakeSnapshot methods
  • Remove HistogramBuckets.LockObject

Stress Test Results (high contention) with Prometheus Exporter scraping interval = 10s

Multiple threads updating the same Histogram MetricPoint

C:\opentelemetry-dotnet\test\OpenTelemetry.Tests.Stress.Metrics>dotnet run --framework net6.0 --configuration Release
Running (concurrency = 8, prometheusEndpoint = http://localhost:9184/metrics/), press <Esc> to stop...
2022-08-05T00:14:26.5444500Z Loops: 838,148,771, Loops/Second: 6,927,676, CPU Cycles/Loop: 574, RunwayTime (Seconds): 120
2022-08-05T00:14:26.7499001Z Loops: 839,576,993, Loops/Second: 6,954,500, CPU Cycles/Loop: 555, RunwayTime (Seconds): 120
2022-08-05T00:14:26.9569740Z Loops: 841,011,261, Loops/Second: 6,939,252, CPU Cycles/Loop: 504, RunwayTime (Seconds): 121
2022-08-05T00:14:27.1648655Z Loops: 842,455,627, Loops/Second: 6,956,855, CPU Cycles/Loop: 504, RunwayTime (Seconds): 121
2022-08-05T00:14:27.3728648Z Loops: 843,925,397, Loops/Second: 7,081,836, CPU Cycles/Loop: 560, RunwayTime (Seconds): 121
2022-08-05T00:14:27.5806669Z Loops: 845,361,965, Loops/Second: 6,926,754, CPU Cycles/Loop: 497, RunwayTime (Seconds): 121
2022-08-05T00:14:27.7877784Z Loops: 846,817,179, Loops/Second: 7,039,306, CPU Cycles/Loop: 483, RunwayTime (Seconds): 121

Stress Test Results (low contention) with Prometheus Exporter scraping interval = 10s

Multiple threads updating random Histogram MetricPoints

Running (concurrency = 8, prometheusEndpoint = http://localhost:9184/metrics/), press <Esc> to stop...
2022-08-05T00:00:58.8000307Z Loops: 2,810,996,143, Loops/Second: 22,274,636, CPU Cycles/Loop: 1,035, RunwayTime (Seconds): 126
2022-08-05T00:00:59.0289061Z Loops: 2,817,588,665, Loops/Second: 22,209,231, CPU Cycles/Loop: 1,031, RunwayTime (Seconds): 126
2022-08-05T00:00:59.2534955Z Loops: 2,822,548,480, Loops/Second: 22,216,776, CPU Cycles/Loop: 1,040, RunwayTime (Seconds): 126
2022-08-05T00:00:59.4750764Z Loops: 2,827,539,322, Loops/Second: 22,264,727, CPU Cycles/Loop: 1,031, RunwayTime (Seconds): 126
2022-08-05T00:00:59.7044293Z Loops: 2,832,168,493, Loops/Second: 21,668,380, CPU Cycles/Loop: 1,030, RunwayTime (Seconds): 127
2022-08-05T00:00:59.9354118Z Loops: 2,837,591,015, Loops/Second: 22,042,236, CPU Cycles/Loop: 1,026, RunwayTime (Seconds): 127
2022-08-05T00:01:00.1581187Z Loops: 2,842,384,330, Loops/Second: 21,681,888, CPU Cycles/Loop: 1,027, RunwayTime (Seconds): 127

Benchmark Results

// * Summary *

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
Intel Core i7-9700 CPU 3.00GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK=6.0.302
[Host] : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT

Method BoundCount Mean Error StdDev Median
HistogramHotPath 10 41.87 ns 0.847 ns 1.101 ns 42.11 ns
HistogramWith1LabelHotPath 10 91.11 ns 1.836 ns 3.402 ns 92.75 ns
HistogramWith3LabelsHotPath 10 194.98 ns 3.920 ns 6.217 ns 196.77 ns
HistogramWith5LabelsHotPath 10 291.10 ns 5.825 ns 10.798 ns 292.59 ns
HistogramWith7LabelsHotPath 10 355.66 ns 7.119 ns 15.324 ns 356.80 ns
HistogramHotPath 20 47.52 ns 0.962 ns 1.029 ns 47.54 ns
HistogramWith1LabelHotPath 20 98.94 ns 2.003 ns 3.907 ns 99.71 ns
HistogramWith3LabelsHotPath 20 197.79 ns 3.957 ns 7.134 ns 198.93 ns
HistogramWith5LabelsHotPath 20 300.20 ns 6.010 ns 12.545 ns 299.06 ns
HistogramWith7LabelsHotPath 20 360.02 ns 7.161 ns 13.624 ns 363.94 ns
HistogramHotPath 50 54.60 ns 1.095 ns 1.705 ns 55.09 ns
HistogramWith1LabelHotPath 50 101.19 ns 2.014 ns 3.831 ns 101.43 ns
HistogramWith3LabelsHotPath 50 215.94 ns 4.332 ns 8.347 ns 217.93 ns
HistogramWith5LabelsHotPath 50 309.55 ns 4.752 ns 4.212 ns 309.56 ns
HistogramWith7LabelsHotPath 50 372.97 ns 7.288 ns 6.086 ns 373.39 ns
HistogramHotPath 100 65.78 ns 1.342 ns 2.205 ns 66.74 ns
HistogramWith1LabelHotPath 100 114.77 ns 2.308 ns 4.446 ns 115.07 ns
HistogramWith3LabelsHotPath 100 231.84 ns 4.622 ns 11.681 ns 233.23 ns
HistogramWith5LabelsHotPath 100 323.52 ns 6.354 ns 8.482 ns 325.70 ns
HistogramWith7LabelsHotPath 100 410.18 ns 8.138 ns 15.680 ns 409.59 ns

@utpilla utpilla requested a review from a team August 3, 2022 23:41
@codecov
Copy link

codecov bot commented Aug 3, 2022

Codecov Report

Merging #3534 (0372ff6) into main (ad55478) will increase coverage by 0.01%.
The diff coverage is 86.20%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3534      +/-   ##
==========================================
+ Coverage   87.26%   87.28%   +0.01%     
==========================================
  Files         277      277              
  Lines        9931     9938       +7     
==========================================
+ Hits         8666     8674       +8     
+ Misses       1265     1264       -1     
Impacted Files Coverage Δ
src/OpenTelemetry/Metrics/HistogramBuckets.cs 100.00% <ø> (ø)
src/OpenTelemetry/Metrics/MetricPoint.cs 83.10% <86.20%> (-1.90%) ⬇️
...ter.ZPages/Implementation/ZPagesActivityTracker.cs 97.14% <0.00%> (-2.86%) ⬇️
...Telemetry/Internal/SelfDiagnosticsEventListener.cs 96.87% <0.00%> (-0.79%) ⬇️
...nTelemetry/Internal/OpenTelemetrySdkEventSource.cs 77.50% <0.00%> (+0.83%) ⬆️
...ZPages/Implementation/ZPagesExporterEventSource.cs 62.50% <0.00%> (+6.25%) ⬆️
...emetry.Api/Internal/OpenTelemetryApiEventSource.cs 82.35% <0.00%> (+8.82%) ⬆️
...tation/OpenTelemetryProtocolExporterEventSource.cs 85.00% <0.00%> (+10.00%) ⬆️

@@ -337,16 +337,18 @@ internal void Update(double number)
var sw = default(SpinWait);
while (true)
{
if (Interlocked.Exchange(ref this.histogramBuckets.IsCriticalSectionOccupied, 1) == 0)
if (Interlocked.CompareExchange(ref this.histogramBuckets.IsCriticalSectionOccupied, 1, 0) == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if there were more learnings based on the questions raised here #3458 (comment).

Does CompareExchange help alleviate the concerns? Or did you learn the concern was unfounded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced CompareExchange in this PR because I think it's much better to try to update IsCriticalSectionOccupied to 1 only when it's actually 0. With the Interlocked.Exchange we would be setting that variable to 1 even when it's already set to 1, and that seems wasteful.

Regarding the learnings for the questions asked in #3458:

  1. We can use Interlocked operations to devise a "lock" that would synchronize the access to histogram related fields here among multiple threads. However, the part where we release this "lock" i.e., where we set IsCriticalSectionOccupied = 0 has to be done using an Interlocked statement as well. This is because Interlocked creates a memory fence around that variable. This prevents certain reorderings of instructions that might set IsCriticalSection to 0 before the original thread is done processing the histogram fields as that might allow another thread to enter the critical section while the original thread has not exited it.

  2. Also, any other thread which might be accessing these histogram fields must use the same synchronization mechanism. That's why I have updated the TakeSnapshot method to use the same Interlocked operations to update IsCriticalSectionOccupied to determine if it's safe to read these fields.

  3. The instructions within this "lock" can be reordered, in this case, the code under unchecked. That's okay because we don't care what order they get executed in.

  4. Regarding the freshness of the values, as long as all the threads that access the same set of fields use the same synchronization mechanism (could be the C# lock statement or our Interlocked based "lock"), any thread that enters the critical section would see the most up-to-date value of these fields.

  5. The only thing to be wary of in this Interlocked based "lock" is exceptions. If an exception is thrown while a thread was in the critical section, IsCriticalSectionOccupied would not be set back to 0 and other threads waiting to enter the critical section would never be able to enter it. In our case, the instructions inside the "lock" would not throw an exception so we don't have to worry about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced CompareExchange in this PR because I think it's much better to try to update IsCriticalSectionOccupied to 1 only when it's actually 0. With the Interlocked.Exchange we would be setting that variable to 1 even when it's already set to 1, and that seems wasteful.

^ might be good to see if perf is impacted with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent summary! Thank you. Number 5 in particular had not occurred to me, but absolutely makes sense.

I'd be interested in perf too. Given that the snapshot and update threads were previously not synchronizing with one another, I'm curious if perf is impacted now that they are correctly synchronized. Maybe not by much, since the snapshot thread only runs seldomly...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this as-is but we should probably do some perf testing using lock, SpinLock, and this new Interlocked code. I think this will be the fastest way but I don't want to blindly trust the old test which proved "Interlocked is faster than lock" because it was bugged when tested 😄

PS: Special shout-out to @noahfalk for helping us on this 👏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for correctness, we need to modify this tests https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/test/OpenTelemetry.Tests/Metrics/MetricAPITest.cs#L1175
It validates multiple threads writing scenario.
We can add a reader thread also, and ultimately everything should add up correctly. (ideally we want to modify the stress tests to do it, so you can run it for hours and at the end, it'll confirm if the metrics are correct).

@utpilla
Copy link
Contributor Author

utpilla commented Aug 4, 2022

I ran Benchmark and Stress tests for Interlocked.Exchange vs Interlocked.CompareExchange. The performance is comparable for both Benchmarks and Stress tests. In fact, I ran the benchmark multiple times only to find that sometimes CompareExchange comes out on top and other times it's Exchange. The difference between the time taken between these two methods is ~0.02ns.

Expand for code
using System.Threading;
using BenchmarkDotNet.Attributes;

namespace Benchmarks.Metrics
{
    [MemoryDiagnoser]
    public class TestBenchmarks
    {
        private int isLocked = 1;

        [Benchmark]
        public void Exchange()
        {
            var temp = Interlocked.Exchange(ref this.isLocked, 1) == 0;
        }

        [Benchmark]
        public void CompareExchange()
        {
            var temp = Interlocked.CompareExchange(ref this.isLocked, 1, 0) == 0;
        }
    }
}
Method Mean Error StdDev Allocated
Exchange 4.351 ns 0.0465 ns 0.0435 ns -
CompareExchange 4.330 ns 0.0663 ns 0.0620 ns -

I have updated the code to go back to using Interlocked.Exchange anyway just to have lesser things to review 😄

@noahfalk
Copy link

noahfalk commented Aug 4, 2022

Beware that single threaded, zero contention lock perf benchmarks can (and probably will) give very different results vs. a test with multiple threads and some contention. Even if threads are never explicitly waiting for another to release the lock, down at the hardware level the cache line backing the critical section memory has to be moved between the caches connected to different cores. When the cores are continuously renegotiating which one has exclusive access to the memory that creates large additional latencies on those Interlocked instructions.

@utpilla
Copy link
Contributor Author

utpilla commented Aug 5, 2022

Check the PRs for the other two synchronization techniques:

Performance comparison of the different synchronization techniques

High contention scenario:

Synchronization technique Average Loops/second Average CPU cycles/loop
Interlocked 6.97 M 525
Classic lock 5.16 M 3117
SpinLock 5.14 M 775

Low contention scenario:

Synchronization technique Average Loops/second Average CPU cycles/loop
Interlocked 22.05M 1031
Classic lock 19.92 M 1127
SpinLock 18.54 M 1160

The Interlocked based "lock" outperforms the classic lock and SpinLock for high contention scenario where multiple threads try to update the same MetricPoint in parallel. It had ~35% more loops/sec than the other techniques. It also has significantly lower CPU cycles/loop: ~83% less than classic lock and ~23% less than SpinLock.

For low contention scenario, there is still some perf gain for Interlocked based "lock", but it's not as high.

The benchmark results for Interlocked based "lock" are comparable to classic lock. The benchmark numbers for SpinLock are lower than that of Interlocked based "lock" and classic lock.

Copy link
Member

@CodeBlanch CodeBlanch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants