Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify optional Exponential Histogram Aggregation, add example code in the data model #2252

Merged
merged 44 commits into from
May 13, 2022

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Jan 7, 2022

Part of #1935.

This protocol was released in OTLP v0.11.

Related OTEP 149.

Changes

In open-telemetry/opentelemetry-collector#4642 I introduced temporary support for printing the exponential histogram data point. I used equations added to the data model as examples here.

This document is meant to justify merging the reference implementation shown in this (unfortunately LARGE) PR: open-telemetry/opentelemetry-go#2393

I've proposed to split it into two parts: open-telemetry/opentelemetry-go#2501. The implementation is described in further detail in its README.

The specification changes in this PR describe the critical aspects of this reference implementation in terms of the new Aggregations' two configuration parameters: MaxSize, and RangeLimits (optional) and two requirements for its behavior.

As further motivation, there is a PR to add receiver/statsdreceiver support using this reference implementation, open-telemetry/opentelemetry-collector-contrib#6666, which is waiting for everything above to be merged.

@jmacd jmacd requested review from a team January 7, 2022 20:30
@jmacd
Copy link
Contributor Author

jmacd commented Jan 7, 2022

@MrAlias FYI, this is meant to assist with eventually merging the reference implementation in OTel-Go.

@jmacd
Copy link
Contributor Author

jmacd commented Jan 7, 2022

The equations included in this PR are tested in the corresponding OTel-Go PR: open-telemetry/opentelemetry-go#2502

specification/metrics/sdk.md Outdated Show resolved Hide resolved
specification/metrics/sdk.md Outdated Show resolved Hide resolved
@jmacd
Copy link
Contributor Author

jmacd commented Jan 12, 2022

@beorn7 I would like your feedback on this proposal. To help you consider the options, consider an OpenTelemetry SDK standing in as a Prometheus client. You have perfect control over the histogram behavior: you can choose a fixed scale factor and have variable size, or you can choose range limits, fixed size and fixed scale, and I believe in a Prometheus setting these decisions should be made up front.

As a first attempt, I outlined an exponential histogram aggregator with one mandatory setting (size) and one optional setting (range limits). The user wouldn't ever set scale directly under this proposal. What do you think?

cc/ @brian-brazil

@brianbrazil-seczetta
Copy link

@brian-brazil?

@jmacd
Copy link
Contributor Author

jmacd commented Jan 12, 2022

Yes, thank you @brianbrazil-seczetta. @brian-brazil one day we hope to add you to the OTel organization 😁

@beorn7
Copy link

beorn7 commented Jan 12, 2022

@jmacd Thanks for pinging me. I have trouble finding time to look at this in detail (since I'm working heads-down on the Prometheus histograms, hopefully getting them to a state where less is in flux and I can give better answers to OTel questions ;). I'm not quite sure what your request is here. In Prometheus, the instrumented binary decides what to expose, and then independent scrapers can do with it what they want (including reducing the resolution AKA increasing the bucket width if they want to store at lower resolution).

https://github.com/prometheus/client_golang/blob/70253f4dd027a7128cdd681c22448d65ca30eed7/prometheus/histogram.go#L384-L407 is how to set the scale.

https://github.com/prometheus/client_golang/blob/70253f4dd027a7128cdd681c22448d65ca30eed7/prometheus/histogram.go#L421-L440 sets the strategy how to limit bucket numbers, which, at a first glance, looks similar to what's proposed here.

Does this help? I'm sorry if I missed the point while just skimming this PR. Feel free to ask more specific questions, and I'll try my best in the time given to me.

@jmacd
Copy link
Contributor Author

jmacd commented Jan 12, 2022

Thank you @beorn7, very helpful feedback.

Compatibility note: In OTel's protocol and this document Scale is equal to -SparseBucketsFactor in your struct.

I am trying to avoid letting the user set scale directly, because it is difficult to reason about. I was proposing that users either (a) do not set scale, or (b) configure max-size and min/max range limits, which imply a fixed scale. I see that your configuration is more flexible.

For OTel to emulate the behavior implied by your sparse-histogram settings, I will have to make an adjustment in this proposal, in probably two parts:

  1. the range limits described thus far have been all-or-none. You're describing an independent minimum limit and no maximum limit. A user could not emulate what you have using WithRangeLimits(minimum, math.MaxFloat64) because we want a way to set the minimum while keeping adjustable scale (i.e., without setting a maximum).
  2. the option to reset a histogram and use a new start timestamp is already built into the OTel data model, but having a histogram setting to force reset when scale falls below a threshold, that could be useful.

By the way, I see you have SparseBucketsFactor float64 . OTel limited scale to integers so that all histogram conversions are non-lossy, but I admit that's not a strong requirement. The equations in this PR could be updated to compute non-integer scales by replacing scaleFactor = math.Ldexp(math.Log2E, scale) with math.Log2E * math.Exp2(scale) and so on; the mapping functions for non-integer scale perform the same as those documented here, only that the use of non-integer scale admits lossy conversions.

OTel reviewers, if you think having the ability to directly set the scale matters, please say so.

I'll update this PR with point (1) above (i.e. make limits independent) and we can address (2) when the time comes. Thank you!

@beorn7
Copy link

beorn7 commented Jan 14, 2022

Compatibility note: In OTel's protocol and this document Scale is equal to -SparseBucketsFactor in your struct.

Actually not. It's more like the scaleFactor in this PR. The user provides the desired growth factor between one bucket boundary and the next, and then the code picks a scale that will provide the largest growth factor that is equal or smaller than the provided SparseBucketsFactor. For example, SparseBucketsFactor = 1.1 results in a Scale of 3 (2^2^-3 = 1.0905).

I am trying to avoid letting the user set scale directly, because it is difficult to reason about.

Yeah, and that's why we do what's described above. The growth factor gives you a good intuition of the precision the histogram provides.

  1. the range limits described thus far have been all-or-none. You're describing an independent minimum limit and no maximum limit. A user could not emulate what you have using WithRangeLimits(minimum, math.MaxFloat64) because we want a way to set the minimum while keeping adjustable scale (i.e., without setting a maximum).

I guess with the same line of argument that leads to a zero bucket of finite width, one might require an "overflow bucket" and an "underflow bucket" for observations exceeding a configurable max/min value. That would be nicely symmetric, but I guess, the demand hasn't come up in practice because it is relatively easy to "accidentally" create observations very close to zero (due to floating point arithmetic precision issues, or if the observations are coming from actual physical measurements), while I would assume the cases where you accidentally create extremely large observations are much less common.

So far, we have gone for not doing "overflow/underflow buckets", but if someone has relevant need, please let me know.

By the way, I see you have SparseBucketsFactor float64 . OTel limited scale to integers so that all histogram conversions are non-lossy, but I admit that's not a strong requirement. The equations in this PR could be updated to compute non-integer scales by replacing scaleFactor = math.Ldexp(math.Log2E, scale) with math.Log2E * math.Exp2(scale) and so on; the mapping functions for non-integer scale perform the same as those documented here, only that the use of non-integer scale admits lossy conversions.

I think this is all a misunderstanding, see above. In OTel terms, the Prometheus histograms (in the current PoC state) always have an integer scale between -8 and 4, and histograms can always be merged precisely to a histogram with the least common resolution.

@jmacd
Copy link
Contributor Author

jmacd commented Jan 25, 2022

By the way, thank you @beorn7 for clarifying: I understand why scaleFactor is a float now, and see no disagreements. :-)

specification/metrics/sdk.md Show resolved Hide resolved
Joshua MacDonald and others added 6 commits April 20, 2022 22:31
Co-authored-by: Aaron Abbott <aaronabbott@google.com>
Co-authored-by: Aaron Abbott <aaronabbott@google.com>
@jmacd
Copy link
Contributor Author

jmacd commented Apr 22, 2022

@reyang The remaining question in this PR gets to a bigger question about handling NaN and Inf values in the Metrics API. Should Exponential Histograms treat NaN and Inf values any differently than Counter instruments. If no, what to do we expect? If yes, what do we expect?

@reyang
Copy link
Member

reyang commented Apr 27, 2022

@reyang The remaining question in this PR gets to a bigger question about handling NaN and Inf values in the Metrics API. Should Exponential Histograms treat NaN and Inf values any differently than Counter instruments. If no, what to do we expect? If yes, what do we expect?

I think none of these should block this PR.

And I think the API spec doesn't care about NaN/Inf.
The SDK spec has a clear position on NaN/Inf https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#numerical-limits-handling.

Here goes something I did for .NET (and it's still experimental)
https://github.com/open-telemetry/opentelemetry-dotnet/blob/117b70c039a9b012887e3aa29717e2d4cee6274e/test/OpenTelemetry.Exporter.Prometheus.Tests/PrometheusSerializerTests.cs#L359-L362

@reyang
Copy link
Member

reyang commented Apr 29, 2022

@aabmass @oertl what's your take on #2252 (comment)?

I'm trying to understand your position - do you think we need to discuss more here (and we should not merge the PR before you feel comfortable to sign off), or we are good to merge this PR and have separate conversation about the NaN/Inf/limits? Thanks!

@jmacd
Copy link
Contributor Author

jmacd commented Apr 29, 2022

@reyang I added two commits.

d15ea33 is meant to address a question from Slack about Prometheus' exponential histogram interoperability.

1a473d0 is meant to answer the question you're asking; I believe we have sufficiently general statements about treating Inf and NaN values, however for histogram aggregation I think we want all-or-none behavior, meaning the Sum, Count, Min, Max, and Buckets should be consistent. Since we can't have consistent results with Inf and NaN values because they do not map into valid buckets, I think histogram implementations MUST disregard these. See what you think.

@wicha
Copy link

wicha commented May 5, 2022

Hey Team OpenTelemetry. CTO/Cofounder of a startup that values your work and uses OpenTelemetry here. Is this going to be merged soon? We need this PR merged for some resolution changes we have internally at our company. I'd appreciate if you could merge 🙏

@reyang
Copy link
Member

reyang commented May 10, 2022

@beorn7 @oertl FYI - we plan to merge this PR by end of the week (May 13th, 2022) unless you still see blocking issues.

@wicha FYI - we'll merge the PR by end of Friday, May 13th, 2022 - unless there is blocking comment(s).

@jmacd
Copy link
Contributor Author

jmacd commented May 10, 2022

@gouthamve see #2252 (comment)

@wicha
Copy link

wicha commented May 12, 2022

@beorn7 @oertl FYI - we plan to merge this PR by end of the week (May 13th, 2022) unless you still see blocking issues.

@wicha FYI - we'll merge the PR by end of Friday, May 13th, 2022 - unless there is blocking comment(s).

🙏 thank you!

@reyang reyang merged commit 3788987 into open-telemetry:main May 13, 2022
beeme1mr pushed a commit to beeme1mr/opentelemetry-specification that referenced this pull request Aug 31, 2022
beeme1mr pushed a commit to beeme1mr/opentelemetry-specification that referenced this pull request Aug 31, 2022
carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this pull request Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:api Cross language API specification issue spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

Successfully merging this pull request may close these issues.