-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-apiserver SLO possible calculation issue #498
Comments
This problem is my own. Its caused by dropping high cardinality metrics from |
Prometheus Operator (which I'd guess maintainers test against) drops from some series from |
#498 (comment) was the solution, hopefully helpful to someone. |
* Reduce `apiserver_request_duration_seconds_count` cardinality by dropping series for non-core Kubernetes APIs. This is done to match `apiserver_request_duration_seconds_count` relabeling * These two relabels must be performed the same way to avoid affecting new SLO calculations (upcoming) * See kubernetes-monitoring/kubernetes-mixin#498 Related: #596
* Reduce `apiserver_request_duration_seconds_count` cardinality by dropping series for non-core Kubernetes APIs. This is done to match `apiserver_request_duration_seconds_count` relabeling * These two relabels must be performed the same way to avoid affecting new SLO calculations (upcoming) * See kubernetes-monitoring/kubernetes-mixin#498 Related: #596
* Reduce `apiserver_request_duration_seconds_count` cardinality by dropping series for non-core Kubernetes APIs. This is done to match `apiserver_request_duration_seconds_count` relabeling * These two relabels must be performed the same way to avoid affecting new SLO calculations (upcoming) * See kubernetes-monitoring/kubernetes-mixin#498 Related: #596
Yeah, you're totally right. The cardinality of these metrics is quite insane. Recently in SIG-instrumentation, there has even been a discussion around overhauling these metrics for exactly that reason, if I recall correctly. |
I'm considering the kube-apiserver SLO rules, alerts, and dashboard, which seem to have been added since last I looked at this repo. However, the rules produce wildly unexpected values (negative availability, -6000% error budget, best case 40% etc), on clusters with healthy apiservers.
Let's consider apiserver_request:availability for writes. At a high level, this tries to measure
1 - (slow requests + error requests) / total reqests
. For me, it evaluates to ~0.40
with slow requests being the supposed contributor (the error part yields 0).Looking closer shows an issue. The query uses a histogram and basically subtracts the "slow requests" taking longer than 1 second from the total count of request events in the histogram.
But these aren't actually measuring the same classes of requests. The assumption in the query above is that these would be equal in an ideal case (all requests with latency less than infinity == all requests).
seconds_bucket
records requests for core API objects (nodes, secrets, configmaps) only.seconds_count
records requests for additional API objects too (certificatesigningrequests, tokenreviews, customresourcedefinitions, endpointslices, networkpolicies). As a result, the query mostly just measures the difference in usage btw core and other groups.Workaround
A fix would be to filter
apiserver_request_duration_seconds_count
togroup=""
. With that change, availability is 100% (or very close) as expected.Before submitting that, I'd like to understand how this was added and others don't see the issue.
apiserver_request_duration_seconds_count
non-core time series?Versions
cc @metalmatze
The text was updated successfully, but these errors were encountered: