Improve error handling when translating classic histograms to distribution (e.g. parsing bucket "le" label edge cases) #982

bwplotka · 2024-05-20T15:43:55Z

We got one Cx getting error from the CMP: One or more TimeSeries could not be written: Field timeSeries[0].points[0].distributionValue had an invalid value: Distribution |explicit_buckets.bounds| entry 1 has a value of 0.005 which is less than the value of entry 0 which is 0.005. This indicates that either:

A. Client exposes classic histogram series with two series for the same le value. This is impossible as duplicated series will be handled by scraping logic.
B. Our conversion from string le labels to float buckets ended up as the same 0.005 buckets. This can totally happen on extreme bucket le values like "0.00500000000000000006" and "0.005" in subsequent classic series. See repro. Furthermore check does not validate for "0" diffs, only negative diffs on it.
C. Rounding error no CMP side. Unlikely as CMP accepts double value as well.

Action Items

prometheus-engine/pkg/export/transform.go

Line 317 in 5f75ba3

if val < 0 {

diff betwen buckets can't be zero too, not only negative.
We catch that problem when we parse to float to see if it's scraping/client issue or parsing to float precision issue.

The text was updated successfully, but these errors were encountered:

TheSpiritXIII · 2024-05-23T18:56:21Z

I tested the equivalent in the standard math/big library and still can't get it to round up:

big.ParseFloat("0.00500000000000000006", 10, 53, big.ToNearestAway) // 0.005

That value cannot be expressed as a float64 (as most values really can't).

What's fascinating is that the Golang client library does not validate the buckets, and other clients probably don't either. The restrictions are by contract in the options field comments. I wonder if these buckets are being created programmatically because I would be shocked if someone created buckets with that type of precision manually!

Further, how does Prometheus itself deal with these? If Prometheus simply stores these as string-typed labels, you would still be able to query from PromQL.

The Golang client uses [sort.SearchFloat64s](https://pkg.go.dev/sort#SearchFloat64s) to pick the bucket, which is just binary search. This means that whether it goes into "0.00500000000000000006" or "0.005" will be random for different bucket sets but always consistent, meaning that one of these will be used and one will not! We may be able to take advantage of the fact that one is never used when sending data to CMP and just ignore the unused one.

But, easier said than done.

Or maybe a simpler solution since the user can't send this metric anyway is to just ignore it as invalid and log a warning.

csapuntz · 2024-07-20T04:44:39Z

A big usability improvement would be know which metric is causing the problem.

csapuntz · 2024-07-25T17:46:08Z

FWIW, my org's software had a bug where it was emitting the same metrics twice and it triggered these errors. Our collectors in GKE are using prometheus:v2.45.3-gmp.5-gke.0 (sha256:aa631b2f63f54f350887000fb9f1fc2cf1a4820652140e458751f27858bdf081)

bwplotka added bug Something isn't working help wanted Extra attention is needed labels May 20, 2024

github-actions bot assigned TheSpiritXIII May 20, 2024

TheSpiritXIII assigned pintohutch and unassigned TheSpiritXIII Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error handling when translating classic histograms to distribution (e.g. parsing bucket "le" label edge cases) #982

Improve error handling when translating classic histograms to distribution (e.g. parsing bucket "le" label edge cases) #982

bwplotka commented May 20, 2024

TheSpiritXIII commented May 23, 2024 •

edited

Loading

csapuntz commented Jul 20, 2024

csapuntz commented Jul 25, 2024

Improve error handling when translating classic histograms to distribution (e.g. parsing bucket "le" label edge cases) #982

Improve error handling when translating classic histograms to distribution (e.g. parsing bucket "le" label edge cases) #982

Comments

bwplotka commented May 20, 2024

Action Items

TheSpiritXIII commented May 23, 2024 • edited Loading

csapuntz commented Jul 20, 2024

csapuntz commented Jul 25, 2024

TheSpiritXIII commented May 23, 2024 •

edited

Loading