kube-apiserver SLO possible calculation issue #498

dghubble · 2020-09-13T02:08:16Z

I'm considering the kube-apiserver SLO rules, alerts, and dashboard, which seem to have been added since last I looked at this repo. However, the rules produce wildly unexpected values (negative availability, -6000% error budget, best case 40% etc), on clusters with healthy apiservers.

Let's consider apiserver_request:availability for writes. At a high level, this tries to measure 1 - (slow requests + error requests) / total reqests. For me, it evaluates to ~0.40 with slow requests being the supposed contributor (the error part yields 0).

Looking closer shows an issue. The query uses a histogram and basically subtracts the "slow requests" taking longer than 1 second from the total count of request events in the histogram.

 # too slow
sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
-
sum(increase(apiserver_request_duration_seconds_bucket{le="1",verb=~"POST|PUT|PATCH|DELETE"}[30d]))

But these aren't actually measuring the same classes of requests. The assumption in the query above is that these would be equal in an ideal case (all requests with latency less than infinity == all requests).

sum(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"})  6388
sum(apiserver_request_duration_seconds_bucket{le="+Inf",verb=~"POST|PUT|PATCH|DELETE"})  15287

seconds_bucket records requests for core API objects (nodes, secrets, configmaps) only. seconds_count records requests for additional API objects too (certificatesigningrequests, tokenreviews, customresourcedefinitions, endpointslices, networkpolicies). As a result, the query mostly just measures the difference in usage btw core and other groups.

sum(apiserver_request_duration_seconds_bucket{le="+Inf",verb=~"POST|PUT|PATCH|DELETE"}) by (resource)
sum(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}) by (resource)

Workaround

A fix would be to filter apiserver_request_duration_seconds_count to group="". With that change, availability is 100% (or very close) as expected.

sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE", group=""}[30d]))
-
sum(increase(apiserver_request_duration_seconds_bucket{le="1",verb=~"POST|PUT|PATCH|DELETE"}[30d]))

Before submitting that, I'd like to understand how this was added and others don't see the issue.

Are folks dropping apiserver_request_duration_seconds_count non-core time series?
Do the above examples work for others? Maybe this is Kubernetes version specific

Versions

Kubernetes: v1.19.1
mixin: release-0.5

cc @metalmatze

The text was updated successfully, but these errors were encountered:

dghubble · 2020-09-13T07:29:57Z

This problem is my own. Its caused by dropping high cardinality metrics from apiserver_request_duration_seconds_bucket, but not from apiserver_request_duration_seconds_count. So either they both need to drop high cardinality metrics or neither should, any difference will break the SLO calculations as described above.

dghubble · 2020-09-13T07:33:59Z

Prometheus Operator (which I'd guess maintainers test against) drops from some series from apiserver_request_duration_seconds_bucket and not from seconds_count, but not in a way that would affect the calculation. So I think that's why others haven't observed this.

https://github.com/prometheus-operator/kube-prometheus/blob/master/manifests/prometheus-serviceMonitorApiserver.yaml#L58

dghubble · 2020-09-13T08:15:18Z

#498 (comment) was the solution, hopefully helpful to someone.

* Reduce `apiserver_request_duration_seconds_count` cardinality by dropping series for non-core Kubernetes APIs. This is done to match `apiserver_request_duration_seconds_count` relabeling * These two relabels must be performed the same way to avoid affecting new SLO calculations (upcoming) * See kubernetes-monitoring/kubernetes-mixin#498 Related: #596

metalmatze · 2020-09-14T10:40:52Z

Yeah, you're totally right. The cardinality of these metrics is quite insane. Recently in SIG-instrumentation, there has even been a discussion around overhauling these metrics for exactly that reason, if I recall correctly.
Glad you figured it out yourself 👍

dghubble closed this as completed Sep 13, 2020

dghubble mentioned this issue Sep 13, 2020

Reduce apiserver metrics cardinality of non-core APIs poseidon/typhoon#830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-apiserver SLO possible calculation issue #498

kube-apiserver SLO possible calculation issue #498

dghubble commented Sep 13, 2020 •

edited

Loading

dghubble commented Sep 13, 2020

dghubble commented Sep 13, 2020

dghubble commented Sep 13, 2020

metalmatze commented Sep 14, 2020

kube-apiserver SLO possible calculation issue #498

kube-apiserver SLO possible calculation issue #498

Comments

dghubble commented Sep 13, 2020 • edited Loading

Workaround

Versions

dghubble commented Sep 13, 2020

dghubble commented Sep 13, 2020

dghubble commented Sep 13, 2020

metalmatze commented Sep 14, 2020

dghubble commented Sep 13, 2020 •

edited

Loading