controller histogram buckets configuration #258

terinjokes · 2018-12-17T08:34:16Z

Metrics are created for the controller's workqueue, using Prometheus's default buckets. Unfortunately, the default buckets are poorly chosen for event processing.

The default buckets are tailored to broadly measure the response time (in seconds) of a network service.

This can easily result in metrics that are extremely coarse. In a controller I was working on today, every single reconcile was faster than the smallest bucket:

controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.005"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.01"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.025"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.05"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.1"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.25"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.5"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="1"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="2.5"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="5"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="10"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="+Inf"} 31

It is possible to change the default buckets by modifying DefBuckets in init, as my init will be called after the DefBuckets variable has been initialized, but before the controllermetrics package init. But this is a very heavy handed brush, changing the defaults of all histograms.

I propose two paths forward:

Allow the user to pass their own metrics to the controller, with the current collectors used if the none are provided.
Alternatively, move the current package outside of the internal package, This will allow users to call Unregister and then assign a replacement.

My preference would be the first option, as the change to the system is would be more easily understood.

The text was updated successfully, but these errors were encountered:

mengqiy · 2018-12-17T23:09:47Z

The 1st option makes sense to me.

DirectXMan12 · 2018-12-21T02:08:00Z

I think we can probably choose some better buckets, too. This smae discussion is being had right now in the main k/k repo ;-).

DirectXMan12 · 2018-12-27T00:35:06Z

It seems to me like exposing the exact mechanics of how we observe metrics is exposing an implementation detail as public API (at least, that's my gut feeling), particularly since the only way to actually change the bucket definition would be to expose the individual metrics at a API level. That would mean that we'd be tied to a particular set of metrics with a particular interface for measuring them (and it would mean that only a major revision could change/remove metrics, but it's unclear if that's a bad thing). If we ever wanted to switch to, say, contextual metrics measurement, we couldn't without a major version rev, even if we didn't actually change the metrics exposed.

My instinct here is to fix our buckets, and wait for more evidence that a "one-size-fits-all" solution won't work here, especially since we have little evidence that our currently buckets are correct for many cases. The Kubernetes SLO for request latency against the API server is 1s, IIRC, and that's on the high end. This bucket selection seems to be poor in that case. k/k is discussing the same thing right now in the switch from summaries to buckets.

mengqiy · 2018-12-27T01:35:11Z

Ref: kubernetes/kubernetes#63750 and kubernetes/kubernetes#67476 are the upstream issue and PR.

fejta-bot · 2019-04-27T21:01:05Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-27T21:43:37Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

DirectXMan12 · 2019-06-04T00:16:50Z

/remove-lifecycle stale

looks like kube has fixed this upstream, so we can follow suit

/good-first-issue

k8s-ci-robot · 2019-06-04T00:16:50Z

@DirectXMan12:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/remove-lifecycle stale

looks like kube has fixed this upstream, so we can follow suit

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Update kubebuilder OWNERS

vincepri · 2020-02-20T18:29:03Z

/priority awaiting-more-evidence

fejta-bot · 2020-05-20T19:14:28Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

irajdeep · 2020-11-23T08:23:31Z

Hi @DirectXMan12
We also have a use-case where we need more time duration buckets (> 10secs) for controller_runtime_reconcile_time_seconds_bucket histogram.
Should we add the buckets to controller-runtime metrics in accordance with the upstream changes done in this PR: kubernetes/kubernetes#73638. ?

…gram metrics Current metric uses Prometheus default bucket for reconcile time histogram. This bucket is not sufficient to reason about percentile of requests which take less than x seconds when x falls outside the bucket of 10 secs. It's also hard to infer when the reconcile loops are fairly fast, as mentioned in this issue: kubernetes-sigs#258. This PR attempts to define explicit buckets for the metrics, values are chosen based on the apiserver request latency defined here: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L103. The default Prometheus histogram buckets has also been added(wherever missing) to ensure backward compatibility.

mengqiy mentioned this issue Dec 18, 2018

✨ allow user to specify metrics #259

Closed

grantr mentioned this issue Jan 24, 2019

Consider using OpenTelemetry for metrics instead of Prometheus #305

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2019

mengqiy added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels May 28, 2019

k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Jun 4, 2019

DirectXMan12 pushed a commit that referenced this issue Jan 31, 2020

Merge pull request #258 from pwittrock/master

10fc5c7

Update kubebuilder OWNERS

vincepri removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 20, 2020

vincepri added this to the Next milestone Feb 20, 2020

k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Feb 20, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2020

DirectXMan12 closed this as completed Jun 1, 2020

irajdeep mentioned this issue Nov 24, 2020

🌱 Specify buckets for "controller_runtime_reconcile_time_seconds" histogram metrics #1273

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

controller histogram buckets configuration #258

controller histogram buckets configuration #258

terinjokes commented Dec 17, 2018

mengqiy commented Dec 17, 2018

DirectXMan12 commented Dec 21, 2018

DirectXMan12 commented Dec 27, 2018

mengqiy commented Dec 27, 2018

fejta-bot commented Apr 27, 2019

fejta-bot commented May 27, 2019

DirectXMan12 commented Jun 4, 2019

k8s-ci-robot commented Jun 4, 2019

vincepri commented Feb 20, 2020

fejta-bot commented May 20, 2020

irajdeep commented Nov 23, 2020

controller histogram buckets configuration #258

controller histogram buckets configuration #258

Comments

terinjokes commented Dec 17, 2018

mengqiy commented Dec 17, 2018

DirectXMan12 commented Dec 21, 2018

DirectXMan12 commented Dec 27, 2018

mengqiy commented Dec 27, 2018

fejta-bot commented Apr 27, 2019

fejta-bot commented May 27, 2019

DirectXMan12 commented Jun 4, 2019

k8s-ci-robot commented Jun 4, 2019

vincepri commented Feb 20, 2020

fejta-bot commented May 20, 2020

irajdeep commented Nov 23, 2020