Fix Histogram synchronization issue #3534

utpilla · 2022-08-03T23:41:11Z

Changes

Use the same "locking" mechanism for Update and TakeSnapshot methods
Remove HistogramBuckets.LockObject

Stress Test Results (high contention) with Prometheus Exporter scraping interval = 10s

Multiple threads updating the same Histogram MetricPoint

C:\opentelemetry-dotnet\test\OpenTelemetry.Tests.Stress.Metrics>dotnet run --framework net6.0 --configuration Release
Running (concurrency = 8, prometheusEndpoint = http://localhost:9184/metrics/), press <Esc> to stop...
2022-08-05T00:14:26.5444500Z Loops: 838,148,771, Loops/Second: 6,927,676, CPU Cycles/Loop: 574, RunwayTime (Seconds): 120
2022-08-05T00:14:26.7499001Z Loops: 839,576,993, Loops/Second: 6,954,500, CPU Cycles/Loop: 555, RunwayTime (Seconds): 120
2022-08-05T00:14:26.9569740Z Loops: 841,011,261, Loops/Second: 6,939,252, CPU Cycles/Loop: 504, RunwayTime (Seconds): 121
2022-08-05T00:14:27.1648655Z Loops: 842,455,627, Loops/Second: 6,956,855, CPU Cycles/Loop: 504, RunwayTime (Seconds): 121
2022-08-05T00:14:27.3728648Z Loops: 843,925,397, Loops/Second: 7,081,836, CPU Cycles/Loop: 560, RunwayTime (Seconds): 121
2022-08-05T00:14:27.5806669Z Loops: 845,361,965, Loops/Second: 6,926,754, CPU Cycles/Loop: 497, RunwayTime (Seconds): 121
2022-08-05T00:14:27.7877784Z Loops: 846,817,179, Loops/Second: 7,039,306, CPU Cycles/Loop: 483, RunwayTime (Seconds): 121

Stress Test Results (low contention) with Prometheus Exporter scraping interval = 10s

Multiple threads updating random Histogram MetricPoints

Running (concurrency = 8, prometheusEndpoint = http://localhost:9184/metrics/), press <Esc> to stop...
2022-08-05T00:00:58.8000307Z Loops: 2,810,996,143, Loops/Second: 22,274,636, CPU Cycles/Loop: 1,035, RunwayTime (Seconds): 126
2022-08-05T00:00:59.0289061Z Loops: 2,817,588,665, Loops/Second: 22,209,231, CPU Cycles/Loop: 1,031, RunwayTime (Seconds): 126
2022-08-05T00:00:59.2534955Z Loops: 2,822,548,480, Loops/Second: 22,216,776, CPU Cycles/Loop: 1,040, RunwayTime (Seconds): 126
2022-08-05T00:00:59.4750764Z Loops: 2,827,539,322, Loops/Second: 22,264,727, CPU Cycles/Loop: 1,031, RunwayTime (Seconds): 126
2022-08-05T00:00:59.7044293Z Loops: 2,832,168,493, Loops/Second: 21,668,380, CPU Cycles/Loop: 1,030, RunwayTime (Seconds): 127
2022-08-05T00:00:59.9354118Z Loops: 2,837,591,015, Loops/Second: 22,042,236, CPU Cycles/Loop: 1,026, RunwayTime (Seconds): 127
2022-08-05T00:01:00.1581187Z Loops: 2,842,384,330, Loops/Second: 21,681,888, CPU Cycles/Loop: 1,027, RunwayTime (Seconds): 127

Benchmark Results

// * Summary *

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
Intel Core i7-9700 CPU 3.00GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK=6.0.302
[Host] : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT

Method	BoundCount	Mean	Error	StdDev	Median
HistogramHotPath	10	41.87 ns	0.847 ns	1.101 ns	42.11 ns
HistogramWith1LabelHotPath	10	91.11 ns	1.836 ns	3.402 ns	92.75 ns
HistogramWith3LabelsHotPath	10	194.98 ns	3.920 ns	6.217 ns	196.77 ns
HistogramWith5LabelsHotPath	10	291.10 ns	5.825 ns	10.798 ns	292.59 ns
HistogramWith7LabelsHotPath	10	355.66 ns	7.119 ns	15.324 ns	356.80 ns
HistogramHotPath	20	47.52 ns	0.962 ns	1.029 ns	47.54 ns
HistogramWith1LabelHotPath	20	98.94 ns	2.003 ns	3.907 ns	99.71 ns
HistogramWith3LabelsHotPath	20	197.79 ns	3.957 ns	7.134 ns	198.93 ns
HistogramWith5LabelsHotPath	20	300.20 ns	6.010 ns	12.545 ns	299.06 ns
HistogramWith7LabelsHotPath	20	360.02 ns	7.161 ns	13.624 ns	363.94 ns
HistogramHotPath	50	54.60 ns	1.095 ns	1.705 ns	55.09 ns
HistogramWith1LabelHotPath	50	101.19 ns	2.014 ns	3.831 ns	101.43 ns
HistogramWith3LabelsHotPath	50	215.94 ns	4.332 ns	8.347 ns	217.93 ns
HistogramWith5LabelsHotPath	50	309.55 ns	4.752 ns	4.212 ns	309.56 ns
HistogramWith7LabelsHotPath	50	372.97 ns	7.288 ns	6.086 ns	373.39 ns
HistogramHotPath	100	65.78 ns	1.342 ns	2.205 ns	66.74 ns
HistogramWith1LabelHotPath	100	114.77 ns	2.308 ns	4.446 ns	115.07 ns
HistogramWith3LabelsHotPath	100	231.84 ns	4.622 ns	11.681 ns	233.23 ns
HistogramWith5LabelsHotPath	100	323.52 ns	6.354 ns	8.482 ns	325.70 ns
HistogramWith7LabelsHotPath	100	410.18 ns	8.138 ns	15.680 ns	409.59 ns

codecov · 2022-08-03T23:52:03Z

Codecov Report

Merging #3534 (0372ff6) into main (ad55478) will increase coverage by 0.01%.
The diff coverage is 86.20%.

@@            Coverage Diff             @@
##             main    #3534      +/-   ##
==========================================
+ Coverage   87.26%   87.28%   +0.01%     
==========================================
  Files         277      277              
  Lines        9931     9938       +7     
==========================================
+ Hits         8666     8674       +8     
+ Misses       1265     1264       -1

Impacted Files	Coverage Δ
src/OpenTelemetry/Metrics/HistogramBuckets.cs	`100.00% <ø> (ø)`
src/OpenTelemetry/Metrics/MetricPoint.cs	`83.10% <86.20%> (-1.90%)`	⬇️
...ter.ZPages/Implementation/ZPagesActivityTracker.cs	`97.14% <0.00%> (-2.86%)`	⬇️
...Telemetry/Internal/SelfDiagnosticsEventListener.cs	`96.87% <0.00%> (-0.79%)`	⬇️
...nTelemetry/Internal/OpenTelemetrySdkEventSource.cs	`77.50% <0.00%> (+0.83%)`	⬆️
...ZPages/Implementation/ZPagesExporterEventSource.cs	`62.50% <0.00%> (+6.25%)`	⬆️
...emetry.Api/Internal/OpenTelemetryApiEventSource.cs	`82.35% <0.00%> (+8.82%)`	⬆️
...tation/OpenTelemetryProtocolExporterEventSource.cs	`85.00% <0.00%> (+10.00%)`	⬆️

alanwest · 2022-08-04T00:20:21Z

src/OpenTelemetry/Metrics/MetricPoint.cs

@@ -337,16 +337,18 @@ internal void Update(double number)
                        var sw = default(SpinWait);
                        while (true)
                        {
-                            if (Interlocked.Exchange(ref this.histogramBuckets.IsCriticalSectionOccupied, 1) == 0)
+                            if (Interlocked.CompareExchange(ref this.histogramBuckets.IsCriticalSectionOccupied, 1, 0) == 0)


Just curious if there were more learnings based on the questions raised here #3458 (comment).

Does CompareExchange help alleviate the concerns? Or did you learn the concern was unfounded?

I introduced CompareExchange in this PR because I think it's much better to try to update IsCriticalSectionOccupied to 1 only when it's actually 0. With the Interlocked.Exchange we would be setting that variable to 1 even when it's already set to 1, and that seems wasteful.

Regarding the learnings for the questions asked in #3458:

We can use Interlocked operations to devise a "lock" that would synchronize the access to histogram related fields here among multiple threads. However, the part where we release this "lock" i.e., where we set IsCriticalSectionOccupied = 0 has to be done using an Interlocked statement as well. This is because Interlocked creates a memory fence around that variable. This prevents certain reorderings of instructions that might set IsCriticalSection to 0 before the original thread is done processing the histogram fields as that might allow another thread to enter the critical section while the original thread has not exited it.

Also, any other thread which might be accessing these histogram fields must use the same synchronization mechanism. That's why I have updated the TakeSnapshot method to use the same Interlocked operations to update IsCriticalSectionOccupied to determine if it's safe to read these fields.

The instructions within this "lock" can be reordered, in this case, the code under unchecked. That's okay because we don't care what order they get executed in.

Regarding the freshness of the values, as long as all the threads that access the same set of fields use the same synchronization mechanism (could be the C# lock statement or our Interlocked based "lock"), any thread that enters the critical section would see the most up-to-date value of these fields.

The only thing to be wary of in this Interlocked based "lock" is exceptions. If an exception is thrown while a thread was in the critical section, IsCriticalSectionOccupied would not be set back to 0 and other threads waiting to enter the critical section would never be able to enter it. In our case, the instructions inside the "lock" would not throw an exception so we don't have to worry about that.

I introduced CompareExchange in this PR because I think it's much better to try to update IsCriticalSectionOccupied to 1 only when it's actually 0. With the Interlocked.Exchange we would be setting that variable to 1 even when it's already set to 1, and that seems wasteful.

^ might be good to see if perf is impacted with this.

Excellent summary! Thank you. Number 5 in particular had not occurred to me, but absolutely makes sense.

I'd be interested in perf too. Given that the snapshot and update threads were previously not synchronizing with one another, I'm curious if perf is impacted now that they are correctly synchronized. Maybe not by much, since the snapshot thread only runs seldomly...

I'm OK with this as-is but we should probably do some perf testing using lock, SpinLock, and this new Interlocked code. I think this will be the fastest way but I don't want to blindly trust the old test which proved "Interlocked is faster than lock" because it was bugged when tested 😄

PS: Special shout-out to @noahfalk for helping us on this 👏

for correctness, we need to modify this tests https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/test/OpenTelemetry.Tests/Metrics/MetricAPITest.cs#L1175
It validates multiple threads writing scenario.
We can add a reader thread also, and ultimately everything should add up correctly. (ideally we want to modify the stress tests to do it, so you can run it for hours and at the end, it'll confirm if the metrics are correct).

utpilla · 2022-08-04T22:34:14Z

I ran Benchmark and Stress tests for Interlocked.Exchange vs Interlocked.CompareExchange. The performance is comparable for both Benchmarks and Stress tests. In fact, I ran the benchmark multiple times only to find that sometimes CompareExchange comes out on top and other times it's Exchange. The difference between the time taken between these two methods is ~0.02ns.

Expand for code

using System.Threading;
using BenchmarkDotNet.Attributes;

namespace Benchmarks.Metrics
{
    [MemoryDiagnoser]
    public class TestBenchmarks
    {
        private int isLocked = 1;

        [Benchmark]
        public void Exchange()
        {
            var temp = Interlocked.Exchange(ref this.isLocked, 1) == 0;
        }

        [Benchmark]
        public void CompareExchange()
        {
            var temp = Interlocked.CompareExchange(ref this.isLocked, 1, 0) == 0;
        }
    }
}

Method	Mean	Error	StdDev	Allocated
Exchange	4.351 ns	0.0465 ns	0.0435 ns	-
CompareExchange	4.330 ns	0.0663 ns	0.0620 ns	-

I have updated the code to go back to using Interlocked.Exchange anyway just to have lesser things to review 😄

noahfalk · 2022-08-04T23:00:16Z

Beware that single threaded, zero contention lock perf benchmarks can (and probably will) give very different results vs. a test with multiple threads and some contention. Even if threads are never explicitly waiting for another to release the lock, down at the hardware level the cache line backing the critical section memory has to be moved between the caches connected to different cores. When the cores are continuously renegotiating which one has exclusive access to the memory that creates large additional latencies on those Interlocked instructions.

utpilla · 2022-08-05T02:27:17Z

Check the PRs for the other two synchronization techniques:

Performance comparison of the different synchronization techniques

High contention scenario:

Synchronization technique	Average Loops/second	Average CPU cycles/loop
Interlocked	6.97 M	525
Classic lock	5.16 M	3117
SpinLock	5.14 M	775

Low contention scenario:

Synchronization technique	Average Loops/second	Average CPU cycles/loop
Interlocked	22.05M	1031
Classic lock	19.92 M	1127
SpinLock	18.54 M	1160

The Interlocked based "lock" outperforms the classic lock and SpinLock for high contention scenario where multiple threads try to update the same MetricPoint in parallel. It had ~35% more loops/sec than the other techniques. It also has significantly lower CPU cycles/loop: ~83% less than classic lock and ~23% less than SpinLock.

For low contention scenario, there is still some perf gain for Interlocked based "lock", but it's not as high.

The benchmark results for Interlocked based "lock" are comparable to classic lock. The benchmark numbers for SpinLock are lower than that of Interlocked based "lock" and classic lock.

CodeBlanch

LGTM!

Fix Histogram synchronization issue

2c7b79b

utpilla requested a review from a team August 3, 2022 23:41

Fix typo

0c39b78

alanwest reviewed Aug 4, 2022

View reviewed changes

utpilla and others added 5 commits August 3, 2022 18:17

Merge branch 'main' into utpilla/Fix-Histogram-Synchronization-issue

61c53e5

Merge branch 'main' into utpilla/Fix-Histogram-Synchronization-issue

2c9c460

Merge branch 'main' into utpilla/Fix-Histogram-Synchronization-issue

b56a4b2

Revert to using INterlocked.Exchange

429d6ae

Merge branch 'main' into utpilla/Fix-Histogram-Synchronization-issue

71af576

Merge branch 'main' into utpilla/Fix-Histogram-Synchronization-issue

811511c

cijothomas approved these changes Aug 5, 2022

View reviewed changes

cijothomas added 2 commits August 5, 2022 08:52

Merge branch 'main' into utpilla/Fix-Histogram-Synchronization-issue

4a3594e

Merge branch 'main' into utpilla/Fix-Histogram-Synchronization-issue

3d03ed4

alanwest approved these changes Aug 7, 2022

View reviewed changes

cijothomas and others added 4 commits August 8, 2022 22:12

Merge branch 'main' into utpilla/Fix-Histogram-Synchronization-issue

f197a00

Merge branch 'main' into utpilla/Fix-Histogram-Synchronization-issue

9b21d0e

Merge branch 'main' into utpilla/Fix-Histogram-Synchronization-issue

6be065d

Merge branch 'main' into utpilla/Fix-Histogram-Synchronization-issue

0372ff6

CodeBlanch approved these changes Aug 23, 2022

View reviewed changes

cijothomas merged commit 968dc52 into open-telemetry:main Aug 23, 2022

This was referenced Dec 20, 2022

Update CHANGELOG with older fix #4030

Merged

[main-1.3.1] Patch Histogram synchronization fix #4031

Merged

Yun-Ting mentioned this pull request Mar 21, 2023

Add a test for Update and TakeSnapshot methods. #4322

Closed

4 tasks

Yun-Ting mentioned this pull request Mar 28, 2023

added a test for Histogram Update and TakeSnapshot methods #4340

Merged

4 tasks

Yun-Ting mentioned this pull request Oct 5, 2023

add concurrency tests with Coyote to CI #4879

Merged

4 tasks

cijothomas mentioned this pull request Feb 16, 2024

Add stress tests with validation too #5369

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Histogram synchronization issue #3534

Fix Histogram synchronization issue #3534

utpilla commented Aug 3, 2022 •

edited

Loading

codecov bot commented Aug 3, 2022 •

edited

Loading

alanwest Aug 4, 2022

utpilla Aug 4, 2022

cijothomas Aug 4, 2022

alanwest Aug 4, 2022

CodeBlanch Aug 4, 2022

cijothomas Aug 4, 2022

utpilla commented Aug 4, 2022

noahfalk commented Aug 4, 2022

utpilla commented Aug 5, 2022 •

edited

Loading

CodeBlanch left a comment

Fix Histogram synchronization issue #3534

Fix Histogram synchronization issue #3534

Conversation

utpilla commented Aug 3, 2022 • edited Loading

Changes

Stress Test Results (high contention) with Prometheus Exporter scraping interval = 10s

Stress Test Results (low contention) with Prometheus Exporter scraping interval = 10s

Benchmark Results

codecov bot commented Aug 3, 2022 • edited Loading

Codecov Report

alanwest Aug 4, 2022

Choose a reason for hiding this comment

utpilla Aug 4, 2022

Choose a reason for hiding this comment

cijothomas Aug 4, 2022

Choose a reason for hiding this comment

alanwest Aug 4, 2022

Choose a reason for hiding this comment

CodeBlanch Aug 4, 2022

Choose a reason for hiding this comment

cijothomas Aug 4, 2022

Choose a reason for hiding this comment

utpilla commented Aug 4, 2022

noahfalk commented Aug 4, 2022

utpilla commented Aug 5, 2022 • edited Loading

Performance comparison of the different synchronization techniques

High contention scenario:

Low contention scenario:

CodeBlanch left a comment

Choose a reason for hiding this comment

utpilla commented Aug 3, 2022 •

edited

Loading

codecov bot commented Aug 3, 2022 •

edited

Loading

utpilla commented Aug 5, 2022 •

edited

Loading