Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error handling when translating classic histograms to distribution (e.g. parsing bucket "le" label edge cases) #982

Open
bwplotka opened this issue May 20, 2024 · 3 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@bwplotka
Copy link
Collaborator

We got one Cx getting error from the CMP: One or more TimeSeries could not be written: Field timeSeries[0].points[0].distributionValue had an invalid value: Distribution |explicit_buckets.bounds| entry 1 has a value of 0.005 which is less than the value of entry 0 which is 0.005. This indicates that either:

A. Client exposes classic histogram series with two series for the same le value. This is impossible as duplicated series will be handled by scraping logic.
B. Our conversion from string le labels to float buckets ended up as the same 0.005 buckets. This can totally happen on extreme bucket le values like "0.00500000000000000006" and "0.005" in subsequent classic series. See repro. Furthermore check does not validate for "0" diffs, only negative diffs on it.
C. Rounding error no CMP side. Unlikely as CMP accepts double value as well.

Action Items

  • diff betwen buckets can't be zero too, not only negative.
  • We catch that problem when we parse to float to see if it's scraping/client issue or parsing to float precision issue.
@bwplotka bwplotka added bug Something isn't working help wanted Extra attention is needed labels May 20, 2024
@TheSpiritXIII
Copy link
Member

TheSpiritXIII commented May 23, 2024

I tested the equivalent in the standard math/big library and still can't get it to round up:

big.ParseFloat("0.00500000000000000006", 10, 53, big.ToNearestAway) // 0.005

That value cannot be expressed as a float64 (as most values really can't).

What's fascinating is that the Golang client library does not validate the buckets, and other clients probably don't either. The restrictions are by contract in the options field comments. I wonder if these buckets are being created programmatically because I would be shocked if someone created buckets with that type of precision manually!

Further, how does Prometheus itself deal with these? If Prometheus simply stores these as string-typed labels, you would still be able to query from PromQL.

The Golang client uses [sort.SearchFloat64s](https://pkg.go.dev/sort#SearchFloat64s) to pick the bucket, which is just binary search. This means that whether it goes into "0.00500000000000000006" or "0.005" will be random for different bucket sets but always consistent, meaning that one of these will be used and one will not! We may be able to take advantage of the fact that one is never used when sending data to CMP and just ignore the unused one.

But, easier said than done.

Or maybe a simpler solution since the user can't send this metric anyway is to just ignore it as invalid and log a warning.

@csapuntz
Copy link

A big usability improvement would be know which metric is causing the problem.

@csapuntz
Copy link

FWIW, my org's software had a bug where it was emitting the same metrics twice and it triggered these errors. Our collectors in GKE are using prometheus:v2.45.3-gmp.5-gke.0 (sha256:aa631b2f63f54f350887000fb9f1fc2cf1a4820652140e458751f27858bdf081)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants