[OTEL-2348] Improve DDSketch to Sketch conversion #468
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR modifies the algorithm used to round bin counts to integers as part of the conversion from DDSketch to Sketch (
pkg/quantile/ddsketch.go
).Motivation
Some test cases in
TestConvertDDSketchIntoSketch
were failing onmain
, but only when running locally on an ARM device. This test computes the percentiles of a distribution before and after the DDSketch to Sketch conversion, and checks that the relative error introduced is below some bound. The failures turned out to be because the rounding algorithm used as part of the conversion is somewhat numerically unstable.The algorithm rounds bin counts down to an integer, but keeps track of the accumulated rounding error, and adds 1 extra unit to the current bin once it reaches 1.0. Unfortunately, because of floating point imprecision, this accumulated error can end up being 0.99999 in cases where it should be 1.0, delaying the addition of the extra unit until the next bin with a non-integer count, which can be far away from the logical range the extra unit belongs to. Depending on the distribution, this transfer of a single unit between distant bins can end up significantly changing the quantiles.
The test already skipped the comparison for the P99 of certain distributions because of this issue. However, ARM CPUs have slight differences in floating-point behavior from x86, this issue was triggered in cases that were not already skipped.
To solve this issue at the source, I modified the rounding algorithm so that it inserts the extra units determined from accumulated rounding error when said error reaches 0.5, instead of 1.0. This is done differently from the original algorithm, by tracking the sum of the input float counts and the output integer counts, and using
math.Round
to compute the difference. This adds the extra unit somewhere in the middle of its logical range of bins, instead of at or past the end, and should not present the same numerical instability, at least in the common case where the input DDSketch has integer counts.Changes to tests
Switching to this algorithm fixes the tests that were originally failing on ARM, and also allows us to remove the exceptions that were previously made in the test.
Because this PR changes the exact output of the algorithm in a lot of cases, I updated the test files for
TestExponentialHistogramTranslatorOptions
, which expects the output bin counts to have very precise values.I also increased the fineness of the distribution generated in
TestKnownDistributionsQuantile
to let the test pass, and corrected the name of a few test cases.I also replaced the formula for the expected error bound in
TestConvertDDSketchIntoSketch
. The current one was copy-pasted fromTestCreateDDSketchWithSketchMapping
, even though they compute very different things. I added a comment explaining my reasoning above the new calculation. The old bound was ~0.0233 and the new one is ~0.0315, so it is a bit more permissive, but stays within the 5% relative error expected from the input OTel histograms, as tested by theTestKnownDistributionsQuantile
end-to-end test.