Reset exponential aggregator scale after collection #5496

jack-berg · 2023-05-31T22:19:57Z

@alanwest asked me whether or not the java exponential histogram aggregation resets the scale after collection when aggregation temporality is delta. I didn't know, and looking at the code realized that the current behavior in the java sdk has a bug. We do not reset the scale after collection when aggregation temporality is delta. This is a problem because we reuse the DoubleBase2ExponentialHistogramAggregator instances to reduce allocations. So an aggregator used for series key=value1 could receive measurements that cause it be rescaled, then a collection moves that aggregator to the shared object pool. In the next collection window, that aggregator is unlikely to be assigned to to the same series key=value1, and thus might have a suboptimal scale assigned. By suboptimal, I mean that a higher scale (i.e. more granular buckets) could accommodate the range of measurements.

This PR fixes that by resetting the scale of the pooled aggregator instances after collection when aggregation temporality is delta.

jack-berg · 2023-05-31T22:20:23Z

sdk/metrics/src/test/java/io/opentelemetry/sdk/metrics/SdkLongHistogramTest.java

+   * cumulative readers.
+   */
+  @Test
+  void collectMetrics_ExponentialHistogramScaleResets() {


This test fails if the scale doesn't properly reset.

codecov · 2023-05-31T22:35:34Z

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (b14bed8) 91.31% compared to head (9cc13aa) 91.32%.

Additional details and impacted files

@@            Coverage Diff            @@
##               main    #5496   +/-   ##
=========================================
  Coverage     91.31%   91.32%           
+ Complexity     4891     4890    -1     
=========================================
  Files           547      547           
  Lines         14410    14414    +4     
  Branches       1354     1354           
=========================================
+ Hits          13159    13164    +5     
+ Misses          867      866    -1     
  Partials        384      384

Impacted Files	Coverage Δ
...tor/DoubleBase2ExponentialHistogramAggregator.java	`98.66% <100.00%> (+0.03%)`	⬆️
...egator/DoubleBase2ExponentialHistogramBuckets.java	`85.55% <100.00%> (+0.32%)`	⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

jsuereth · 2023-06-05T14:47:31Z

Do we think this is the right behavior?

Should we instead "degrade" scale slightly vs. a full reset? I.e. should we expect scale to stabilize over time (for maximum efficiency during collection) vs. reseting?

Rescaling isn't "free" but isn't too expensive, but it does add overhead.

I think this makes sense as a "simple" behavior, but there might be a slightly less complicated change that helps the histograms stabilize their dynamic range over time.

jack-berg · 2023-06-05T15:18:39Z

Should we instead "degrade" scale slightly vs. a full reset? I.e. should we expect scale to stabilize over time (for maximum efficiency during collection) vs. reseting?

You can make the argument that the scale for a particular time series would stabilize over time, but we pool the aggregator objects at the instrument level. So from one collection to another a series will be assigned a random aggregator, and as a result, a scale factor from a series whose measurements may not be close to the new series the aggregator is assigned to. This is clearly not correct and needs to be fixed.

Dotnet behaves a little differently than we do. They ensure that the same aggregator instance is assigned to the same series across collections. They don't reset the scale factor of exponential histogram aggregators under the premise that the scale of a series should stabilize over time. But as a consequence of this, they can only have 2000 (or whatever configured cardinality limit) distinct series over the lifetime of the application. In contrast, we have 2000 distinct series per collection, but the series can be different from collection to collection.

If we went in the dotnet direction, we could ensure the same aggregator is used for the same series across collections, and stop resetting the scale, instead counting on it stabilizing over time. However, we'd want to do this in a separate PR because its a behavioral changed with pros and cons that need to be considered independently.

alanwest · 2023-06-05T17:13:17Z

Just chiming in with a bit more details from .NET...

Our plan is to eventually introduce the reclaiming of aggregator objects when using delta temporality. Just like with this PR, the scale of a histogram will likely be reset once an aggregator has been reclaimed. Though, one other subtle difference with our eventual implementation is that we will only reclaim an aggregator once it has gone "stale". Stale for us will mean that the series has not received any updates within the last collection cycle.

lenin-jaganathan · 2023-09-20T16:50:14Z

I am not sure if it is too late or maybe not applicable here. While working to implement support for Exponential Histograms in Micrometers, resetting the exponential histogram for every interval makes the rescaling costly. Usually, the scale stays constant throughout the app's lifecycle in ideal cases, but due to resource crunches or some other reason if the scale goes down, it usually is only for an intermittent period.

So I ended up implementing something like this https://github.com/micrometer-metrics/micrometer/blob/c47936068b163c81c4872da05d26f2eeb71133f2/implementations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/internal/Base2ExponentialHistogram.java#L211-L234 which reduced the allocations by a large margin. The trade-off is that 1 minute's worth of data might be using a lower scale.

Reset exponential aggregator scale after collection

9cc13aa

jack-berg requested a review from a team May 31, 2023 22:19

jack-berg commented May 31, 2023

View reviewed changes

jack-berg added this to the v1.27.0 milestone Jun 8, 2023

trask approved these changes Jun 8, 2023

View reviewed changes

jkwatson approved these changes Jun 8, 2023

View reviewed changes

jack-berg merged commit 951221e into open-telemetry:main Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset exponential aggregator scale after collection #5496

Reset exponential aggregator scale after collection #5496

jack-berg commented May 31, 2023

jack-berg May 31, 2023

codecov bot commented May 31, 2023

jsuereth commented Jun 5, 2023

jack-berg commented Jun 5, 2023

alanwest commented Jun 5, 2023

lenin-jaganathan commented Sep 20, 2023

Reset exponential aggregator scale after collection #5496

Reset exponential aggregator scale after collection #5496

Conversation

jack-berg commented May 31, 2023

jack-berg May 31, 2023

Choose a reason for hiding this comment

codecov bot commented May 31, 2023

Codecov Report

jsuereth commented Jun 5, 2023

jack-berg commented Jun 5, 2023

alanwest commented Jun 5, 2023

lenin-jaganathan commented Sep 20, 2023