Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset exponential aggregator scale after collection #5496

Merged
merged 1 commit into from
Jun 8, 2023

Conversation

jack-berg
Copy link
Member

@alanwest asked me whether or not the java exponential histogram aggregation resets the scale after collection when aggregation temporality is delta. I didn't know, and looking at the code realized that the current behavior in the java sdk has a bug. We do not reset the scale after collection when aggregation temporality is delta. This is a problem because we reuse the DoubleBase2ExponentialHistogramAggregator instances to reduce allocations. So an aggregator used for series key=value1 could receive measurements that cause it be rescaled, then a collection moves that aggregator to the shared object pool. In the next collection window, that aggregator is unlikely to be assigned to to the same series key=value1, and thus might have a suboptimal scale assigned. By suboptimal, I mean that a higher scale (i.e. more granular buckets) could accommodate the range of measurements.

This PR fixes that by resetting the scale of the pooled aggregator instances after collection when aggregation temporality is delta.

@jack-berg jack-berg requested a review from a team May 31, 2023 22:19
* cumulative readers.
*/
@Test
void collectMetrics_ExponentialHistogramScaleResets() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails if the scale doesn't properly reset.

@codecov
Copy link

codecov bot commented May 31, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (b14bed8) 91.31% compared to head (9cc13aa) 91.32%.

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #5496   +/-   ##
=========================================
  Coverage     91.31%   91.32%           
+ Complexity     4891     4890    -1     
=========================================
  Files           547      547           
  Lines         14410    14414    +4     
  Branches       1354     1354           
=========================================
+ Hits          13159    13164    +5     
+ Misses          867      866    -1     
  Partials        384      384           
Impacted Files Coverage Δ
...tor/DoubleBase2ExponentialHistogramAggregator.java 98.66% <100.00%> (+0.03%) ⬆️
...egator/DoubleBase2ExponentialHistogramBuckets.java 85.55% <100.00%> (+0.32%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@jsuereth
Copy link
Contributor

jsuereth commented Jun 5, 2023

Do we think this is the right behavior?

Should we instead "degrade" scale slightly vs. a full reset? I.e. should we expect scale to stabilize over time (for maximum efficiency during collection) vs. reseting?

Rescaling isn't "free" but isn't too expensive, but it does add overhead.

I think this makes sense as a "simple" behavior, but there might be a slightly less complicated change that helps the histograms stabilize their dynamic range over time.

@jack-berg
Copy link
Member Author

Should we instead "degrade" scale slightly vs. a full reset? I.e. should we expect scale to stabilize over time (for maximum efficiency during collection) vs. reseting?

You can make the argument that the scale for a particular time series would stabilize over time, but we pool the aggregator objects at the instrument level. So from one collection to another a series will be assigned a random aggregator, and as a result, a scale factor from a series whose measurements may not be close to the new series the aggregator is assigned to. This is clearly not correct and needs to be fixed.

Dotnet behaves a little differently than we do. They ensure that the same aggregator instance is assigned to the same series across collections. They don't reset the scale factor of exponential histogram aggregators under the premise that the scale of a series should stabilize over time. But as a consequence of this, they can only have 2000 (or whatever configured cardinality limit) distinct series over the lifetime of the application. In contrast, we have 2000 distinct series per collection, but the series can be different from collection to collection.

If we went in the dotnet direction, we could ensure the same aggregator is used for the same series across collections, and stop resetting the scale, instead counting on it stabilizing over time. However, we'd want to do this in a separate PR because its a behavioral changed with pros and cons that need to be considered independently.

@alanwest
Copy link
Member

alanwest commented Jun 5, 2023

Just chiming in with a bit more details from .NET...

Our plan is to eventually introduce the reclaiming of aggregator objects when using delta temporality. Just like with this PR, the scale of a histogram will likely be reset once an aggregator has been reclaimed. Though, one other subtle difference with our eventual implementation is that we will only reclaim an aggregator once it has gone "stale". Stale for us will mean that the series has not received any updates within the last collection cycle.

@jack-berg jack-berg added this to the v1.27.0 milestone Jun 8, 2023
@jack-berg jack-berg merged commit 951221e into open-telemetry:main Jun 8, 2023
@lenin-jaganathan
Copy link

I am not sure if it is too late or maybe not applicable here. While working to implement support for Exponential Histograms in Micrometers, resetting the exponential histogram for every interval makes the rescaling costly. Usually, the scale stays constant throughout the app's lifecycle in ideal cases, but due to resource crunches or some other reason if the scale goes down, it usually is only for an intermittent period.

So I ended up implementing something like this https://github.com/micrometer-metrics/micrometer/blob/c47936068b163c81c4872da05d26f2eeb71133f2/implementations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/internal/Base2ExponentialHistogram.java#L211-L234 which reduced the allocations by a large margin. The trade-off is that 1 minute's worth of data might be using a lower scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants