-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stabilize Overflow attribute section under Cardinality Limits #3904
Comments
#3798 This seems like a pre-req. |
Rust implementation is pretty basic - just enforced a hardcoded 2000 limit, not configured in anyway - neither via Views nor at Provider/Reader level. I hope other implementations are more complete? |
@cijothomas I believe you're referring to the Configuration section under Cardinality limits. This issue only targets the Overflow attribute section. I think as long as the SDKs have imposed some cardinality limit (be it hardcoded or configurable), the stability of metric overflow attribute can be tracked independently. |
Oh this was just for that subsection, got it. What is the benefit of stabilizing that part alone? |
Even as a stand-alone feature, this is very useful. Here are two major benefits of using this feature:
|
@utpilla I agree with your reasoning -- there can be disagreement over how the behavior is implemented, but the attribute meaning is well defined either way, I think, and I agree that we can stabilize it. |
The TC discussed this issue during the specification issue triage meeting. Here are the two things that need to be done before we mark the issue as triage-accepted:
|
FYI @cijothomas @jack-berg @jmacd @jsuereth - I sent a PR #3912 trying to address 1). |
what does it mean to support more than one limit? thanks |
It is not about more than one limit, it is about "the SDK can offer more room if there is a good reason", for example, the user can set the limit to 10 and the SDK can decide to offer 12 (or whatever number as long as it is greater than or equal to 10). |
do we have a reason in mind already? it may be worth mentioning it in the PR, since there's a downside to not respecting the users requested limit (it could be a confusing user experience if they set the cardinality limit to one value but they see higher cardinality being reported to their backend) |
I can share some of the thinking here (the PR description already links to this issue):
And more importantly, removing the "is one less than the limit, as a result" wording makes it easier for the user. Imagine the users just want to report a value without any attribute, they have to understand and set the limit to |
|
@jack-berg @jmacd @jsuereth @reyang @trask For the second point, please check this README file. I have put up some text explaining how the overflow attribute plays out with the Prometheus/Grafana workflow and how it adds value for the users. Let me know if you have any questions or suggestions. |
Since this is a stabilization tracking issue, there are a few open issues related to cardinality limits:
#3578 in particular seems important to resolve before marking this stable. |
I saw the TC notes. Is there any particular feedback related to Prometheus we want? The current spec has drawbacks for prometheus users, but these likely apply to non-prometheus users as well:
@fstab and @gouthamve have provided a helpful feedback in the past, and may know scenarios I haven't thought of, or have suggestions for how to handle cardinality overflow. |
Can you elaborate on this? It seems possible to write a prometheus query that searches across all metric names with a particular label. Is it prohibitively expensive to run this query for alerting purposes? Or something else? |
Thanks @jack-berg. You are correct, and I didn't realize that was possible. |
I was assuming this was prohibitively expensive. If not, I think we have a way forward here for stabilzation. |
I'm not sure if it would be prohibitively expensive or not. |
Sorry I have not studied the complete back-story, but an idea came to me: In this text:
if the word "single" is deleted, the other attributes could be replaced with an overflow value and Prometheus queries/joins/etc would now work. In other words, instead of:
Prometheus would see:
|
Yup, querying across all metric names can be tricky/impossible/expensive for some backends, but Prometheus should be generally fine if queried correctly (e.g. just However, what @bboreham figured out would be small but amazing modification. It should mitigate the big challenges with the overflow logic @dashpole mentioned:
The slight disadvantage is question what to do if the target has multiple set of different label names on single metric, but perhaps including all labels (dimensions) that overflows on single overflow series might do the trick. Also |
Please do not invent new conventions that do not follow Prometheus best practices. In the event of a data error in a collector, best practice is to fail the whole collection with a 5xx error.
This follows the "Fail Fast" best practice. |
@SuperQ I believe what happens today in Prometheus (with client lirbaries) is the process itself crashes (eventually) on high cardanility. OTEL deemed this unacceptable, which is why we suppress/limit cardinality client side. If you have links to other ways prometheus deals with things client-side happy to follow the best practice. But I would not say there is one today. |
@bboreham @bwplotka I do like what you're suggesting with setting the value of labels to "oveflow". It's not entirely feasible in OTEL because we don't have string labels, we have typed attributes (so a label could be an int, e.g.). I think this is why we just add the new label here. I do really like your suggestion though. In this case, I think our assumption is that in prometheus the label will have an empty value (e..g overflow="true", a="", b=""). We assume (via best practice) the a/b will have real values in most cases, so an empty label will already stand out as much as a "..." would. @jack-berg / @reyang do you remember all the rationale we had when we decided on just one new label? Is it just the typed-attribute issue? |
One option would be for prometheus exporters to fail the scrape if the overflow attribute is set on any metrics. But unlike with the prometheus server's scrape limits, you wouldn't be able to mitigate it by changing a server-side config. You would need to restart your client with a higher overflow limit. An alternative would be to recommend to users that they set their scrape limits on the prometheus server lower than the client-side limits so they still get the scrape failures, but also have the ability to mitigate via server config. |
@dashpole, yup, exactly my thoughts. If you handle sample limits in the library you want to fail the scrape. But as you point out, it's now the client's issue to deal with it. But that sounds like it's working as intended. In Prometheus, we leave it up to users to decide what cardinality they want from the client libraries. Like you say, Prometheus can handle cardinality limits on the server side with admin configured scrape limits. In practice we've found users tend to notice on the server side long before it becomes a client-side issue. Since the monitoring system has to deal with the total cardinality of all instances, it tends to show up there first. The Prometheus client libraries use very few bytes of memory per metric exposed. I can get some benchmark tests of this if it helps. |
@dashpole This doesn't fix the fact that they will consume memory storing metrics client side. |
@jsuereth sorry for the confusion. I'm not suggesting we remove client-side limits, and I think those are useful. I meant to say that we could keep the current client-side limits and attribute, and tell prometheus users to set scrape limits below our default value as a way to get the prometheus-recommended behavior. That way, we would not need to introduce scrape failures in our exporter, which could cause the problem described above for users. @SuperQ I trust your assertiib regarding low memory usage of prometheus client libraries. I would read this proposal as adding an additional layer of protection for users to prevent DoS, not as a replacement for server-specified limits. I agree that Prometheus users will generally notice cardinality problems in the server before the client, assuming they are scraping the endpoint. |
Nerd-sniped myself into writing a quick test. I generated 100k and 1M tests with metrics like Code: package main
import (
"fmt"
"log"
"net/http"
_ "net/http/pprof"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
"github.com/thanhpk/randstr"
)
func main() {
fmt.Println("Generating metrics")
metricTest := promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "test_cardinality_total",
Help: "Test for cardinality",
},
[]string{"test_label"},
)
for i := 0; i < 100_000; i++ {
label := randstr.String(32)
fmt.Printf("Generating metric %d %s\n", i, label)
metricTest.WithLabelValues(label).Inc()
}
http.Handle("/metrics", promhttp.Handler())
fmt.Println("Listening on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
} |
@jsuereth @jack-berg @bboreham It is not. Imagine this:
|
Yup, that's a good DoS. And I'm also not suggesting that you avoid implementing client-side protections. It's just that there is a clearly defined best practice for exposing the result of those protections. Which is to fail closed / fail fast. This provides the best "principal of least surprise" when it comes to data. Having a partial result / soft fail provides a more difficult to cognitively and programatically handle observability result. |
The TC discussed metrics cardinality limits and overflow attribute during the Mar. 13th, 2024 meeting. The TC agreed to unblock #3905, the short summary will be added to the cardinality limits section in the spec to clarify the roadmap. Once the PR is merged, @reyang will publish a blog post clarify how OpenTelemetry plans to handle cardinality limits, and what should the users expect from experience perspective. |
@open-telemetry/technical-committee please take another look at this one, the PR #3905 was closed since it stalled, is this still something we want to accept or should we move it back to "deciding" or even reject it? (cc @jpkrohling @danielgblanco) |
Yes, I'll be working on it. We were focusing on getting Exemplars stable #3870 that's why this got delayed. |
FYI the TC discussed this last week and the discussion lead us to believe the the aforementioned issues do not block the stabilization and/or can be implemented in the future as additions. Feel free to provide feedback in in case you think this is not the case. |
What are you trying to achieve?
Mark the Overflow attribute section under Cardinality limits as Stable.
The spec was updated with the overflow attribute and cardinality limits section in #2960.
It looks like we already have more than three languages that have added support for this:
The text was updated successfully, but these errors were encountered: