-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reset exponential aggregator scale after collection #5496
Conversation
* cumulative readers. | ||
*/ | ||
@Test | ||
void collectMetrics_ExponentialHistogramScaleResets() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails if the scale doesn't properly reset.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #5496 +/- ##
=========================================
Coverage 91.31% 91.32%
+ Complexity 4891 4890 -1
=========================================
Files 547 547
Lines 14410 14414 +4
Branches 1354 1354
=========================================
+ Hits 13159 13164 +5
+ Misses 867 866 -1
Partials 384 384
☔ View full report in Codecov by Sentry. |
Do we think this is the right behavior? Should we instead "degrade" scale slightly vs. a full reset? I.e. should we expect scale to stabilize over time (for maximum efficiency during collection) vs. reseting? Rescaling isn't "free" but isn't too expensive, but it does add overhead. I think this makes sense as a "simple" behavior, but there might be a slightly less complicated change that helps the histograms stabilize their dynamic range over time. |
You can make the argument that the scale for a particular time series would stabilize over time, but we pool the aggregator objects at the instrument level. So from one collection to another a series will be assigned a random aggregator, and as a result, a scale factor from a series whose measurements may not be close to the new series the aggregator is assigned to. This is clearly not correct and needs to be fixed. Dotnet behaves a little differently than we do. They ensure that the same aggregator instance is assigned to the same series across collections. They don't reset the scale factor of exponential histogram aggregators under the premise that the scale of a series should stabilize over time. But as a consequence of this, they can only have 2000 (or whatever configured cardinality limit) distinct series over the lifetime of the application. In contrast, we have 2000 distinct series per collection, but the series can be different from collection to collection. If we went in the dotnet direction, we could ensure the same aggregator is used for the same series across collections, and stop resetting the scale, instead counting on it stabilizing over time. However, we'd want to do this in a separate PR because its a behavioral changed with pros and cons that need to be considered independently. |
Just chiming in with a bit more details from .NET... Our plan is to eventually introduce the reclaiming of aggregator objects when using delta temporality. Just like with this PR, the scale of a histogram will likely be reset once an aggregator has been reclaimed. Though, one other subtle difference with our eventual implementation is that we will only reclaim an aggregator once it has gone "stale". Stale for us will mean that the series has not received any updates within the last collection cycle. |
I am not sure if it is too late or maybe not applicable here. While working to implement support for Exponential Histograms in Micrometers, resetting the exponential histogram for every interval makes the rescaling costly. Usually, the scale stays constant throughout the app's lifecycle in ideal cases, but due to resource crunches or some other reason if the scale goes down, it usually is only for an intermittent period. So I ended up implementing something like this https://github.com/micrometer-metrics/micrometer/blob/c47936068b163c81c4872da05d26f2eeb71133f2/implementations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/internal/Base2ExponentialHistogram.java#L211-L234 which reduced the allocations by a large margin. The trade-off is that 1 minute's worth of data might be using a lower scale. |
@alanwest asked me whether or not the java exponential histogram aggregation resets the scale after collection when aggregation temporality is delta. I didn't know, and looking at the code realized that the current behavior in the java sdk has a bug. We do not reset the scale after collection when aggregation temporality is delta. This is a problem because we reuse the
DoubleBase2ExponentialHistogramAggregator
instances to reduce allocations. So an aggregator used for serieskey=value1
could receive measurements that cause it be rescaled, then a collection moves that aggregator to the shared object pool. In the next collection window, that aggregator is unlikely to be assigned to to the same serieskey=value1
, and thus might have a suboptimal scale assigned. By suboptimal, I mean that a higher scale (i.e. more granular buckets) could accommodate the range of measurements.This PR fixes that by resetting the scale of the pooled aggregator instances after collection when aggregation temporality is delta.