-
Notifications
You must be signed in to change notification settings - Fork 772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sdk-metrics] Promote overflow attribute from experimental to stable #5909
[sdk-metrics] Promote overflow attribute from experimental to stable #5909
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5909 +/- ##
==========================================
+ Coverage 83.38% 86.27% +2.88%
==========================================
Files 297 260 -37
Lines 12531 11437 -1094
==========================================
- Hits 10449 9867 -582
+ Misses 2082 1570 -512
Flags with carried forward coverage won't be shown. Click here to find out more.
|
…ected for a given metric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good overall.
I am requesting couple of changes
- Remove internal log for cardinality hit.
- Rephrase changelog.
(I left detailed comment about each of above in PR).
Also, we could use some refactoring on the docs, but this can be followed up separately. If you need help, I can help with the doc part.
…flow attribute and this behavior won't be able to be turned off.
Co-authored-by: Reiley Yang <reyang@microsoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a minor suggestion.
docs/metrics/README.md
Outdated
the [OpenTelemetry | ||
Specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#overflow-attribute) | ||
become stable, this feature will be turned on by default. | ||
Starting with version 1.10.0, given a metric, once the cardinality limit is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think its best to mention the pre 1.10.0 behavior too here.
Versions prior to 1.10.0 had a different behavior when cardinality limit was hit. The measurements was either dropped (default) and an internal log was emitted once or aggregated using overflow attribute(opt-in basis)
(we have some places in this repo where were use such style in document)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it's better if the README reflects the current scenario. Adding prior behavior may increase the length of the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the whole thing is a bit confusing 🤣
Here is my stab at improving it:
Given a metric, once the cardinality limit is reached, any new measurement which cannot be independently aggregated because of the limit will be dropped. Starting with version 1.10.0, when a measurement is dropped, the the overflow attribute is updated automatically.
I think what is important is the first statement "Given a metric, once the cardinality limit is reached, any new measurement which cannot be independently aggregated because of the limit will be dropped." That is a hard stop thing. The way it is currently written, it seems this is new to 1.10.0 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this is tricky. One is the Previous behavior where there is default and experimental opt-in overflow attribute. The other one is the New behavior with the always on overflow attribute. It will also be too long if we explain everything in README.
I've changed the README to include minimal info to mention things has changed a few times, and put more details in the CHANGELOG in case anyone is really interested and want to better understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what is important is the first statement "Given a metric, once the cardinality limit is reached, any new measurement which cannot be independently aggregated because of the limit will be dropped." That is a hard stop thing. The way it is currently written, it seems this is new to 1.10.0 😄
That's a good point. I've updated it to make that first statement more prominent and followed with how our SDK handles and how approaches changed.
One small thing, the spec seems to not see it as a "drop" when it's done with the overflow attribute, but rather a "synthetic aggregation". So it's still "aggregated", but not "independently aggregated".
An overflow attribute set is defined, containing a single attribute otel.metric.overflow having (boolean) value true, which is used to report a synthetic aggregation of the Measurements that could not be independently aggregated because of the limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some suggestions on changelog and readme/doc part, rest looks good.
They can be a follow up PR too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see this stabilized.
I like the current CHANGELOG entry.
The SDK now always uses the overflow attribute (`otel.metric.overflow = true`) | ||
to aggregate measurements when the cardinality limit is reached. The previous | ||
approach of dropping measurements has been removed. No internal logs are | ||
emitted when the limit is hit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feel very misleading to me. The measurements are still dropped. The overflow attribute is not a fix for that, it is just a detection mechanism to alert users when they have an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your definition of "dropped" vs. not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not the right question. I'm not the audience. The question is, what is user's definition of dropped and expectation when something is recorded? I promise you it is not "my value and dimensions get folded into a flag and thrown away" 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the specification perspective, "dropped" means the value is ignored, it has no effect to the final metrics. Since overflow attribute keeps the accurate total sum for Counter (which means the user will always get the correct total sum, whether overflow happened or not), and correct percentile for Histogram, it shouldn't be considered "dropped".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. My perspective is... I'm some user and I look at my backend and I see all my metrics, I'm happy. Then a day later I look and everything is gone. Or suddenly everything is empty? Certainly not what I saw before, right? So naturally I'm going to think data is missing and must be getting dropped somewhere. But I read the SDK CHANGELOG and it said it no longer drops anything! Now I'm upset and declare the SDK bugged and open an issue to vent my anger 🔥
I'm not saying the spec is wrong. But I don't think the users will understand the nuance 🤔
Given a metric, once the cardinality limit is reached, any new measurement that could not be independently aggregated will be aggregated using the overflow attribute.
IMO this is going to be like lawyer jargon for users 😄 I think we should break it down in more layman/accessible terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then a day later I look and everything is gone. Or suddenly everything is empty? Certainly not what I saw before, right? So naturally I'm going to think data is missing and must be getting dropped somewhere.
I'm very confused. Would you explain more? E.g. could you give a concrete example about "everything is empty" or "everything is gone"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that the wording may not easily be understood by all end-users. The https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#cardinality-limits doc is very good, and perhaps we can make some additions to it to explain how the sdk behaves when limit is hit, and how to interpret overflow attribute correctly.
One important think I'd like called out is, if user has a query like sum of all requests, where route=foo, and an overflow exists - then that query is no longer trustable, as there is no way to tell if a route=foo measurement was folded into overflow. The only thing trustable in the event an overflow exists is the total metrics (i.e the one which do not filter based on any dimensions).
If none volunteers to make this change in the doc, I can cover it. (I am implementing similar thing for OTel Rust right now, so I can hopefully steal some wordings! Ideally this should be covered in otel docs website, so every language can benefit)
https://github.com/utpilla/MetricOverflowAttribute?tab=readme-ov-file can be a good starting point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened an issue #5939 to track this.
Fixes #5641.
Design discussion issue #
Changes
Remove the
OTEL_DOTNET_EXPERIMENTAL_METRICS_EMIT_OVERFLOW_ATTRIBUTE
variable and make the feature enabled by default.Merge requirement checklist
CHANGELOG.md
files updated for non-trivial changes