Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller histogram buckets configuration #258

Closed
terinjokes opened this issue Dec 17, 2018 · 11 comments
Closed

controller histogram buckets configuration #258

terinjokes opened this issue Dec 17, 2018 · 11 comments
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Milestone

Comments

@terinjokes
Copy link
Contributor

Metrics are created for the controller's workqueue, using Prometheus's default buckets. Unfortunately, the default buckets are poorly chosen for event processing.

The default buckets are tailored to broadly measure the response time (in seconds) of a network service.

This can easily result in metrics that are extremely coarse. In a controller I was working on today, every single reconcile was faster than the smallest bucket:

controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.005"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.01"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.025"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.05"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.1"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.25"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.5"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="1"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="2.5"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="5"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="10"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="+Inf"} 31

It is possible to change the default buckets by modifying DefBuckets in init, as my init will be called after the DefBuckets variable has been initialized, but before the controllermetrics package init. But this is a very heavy handed brush, changing the defaults of all histograms.

I propose two paths forward:

  1. Allow the user to pass their own metrics to the controller, with the current collectors used if the none are provided.
  2. Alternatively, move the current package outside of the internal package, This will allow users to call Unregister and then assign a replacement.

My preference would be the first option, as the change to the system is would be more easily understood.

@mengqiy
Copy link
Member

mengqiy commented Dec 17, 2018

The 1st option makes sense to me.

@DirectXMan12
Copy link
Contributor

I think we can probably choose some better buckets, too. This smae discussion is being had right now in the main k/k repo ;-).

@DirectXMan12
Copy link
Contributor

It seems to me like exposing the exact mechanics of how we observe metrics is exposing an implementation detail as public API (at least, that's my gut feeling), particularly since the only way to actually change the bucket definition would be to expose the individual metrics at a API level. That would mean that we'd be tied to a particular set of metrics with a particular interface for measuring them (and it would mean that only a major revision could change/remove metrics, but it's unclear if that's a bad thing). If we ever wanted to switch to, say, contextual metrics measurement, we couldn't without a major version rev, even if we didn't actually change the metrics exposed.

My instinct here is to fix our buckets, and wait for more evidence that a "one-size-fits-all" solution won't work here, especially since we have little evidence that our currently buckets are correct for many cases. The Kubernetes SLO for request latency against the API server is 1s, IIRC, and that's on the high end. This bucket selection seems to be poor in that case. k/k is discussing the same thing right now in the switch from summaries to buckets.

@mengqiy
Copy link
Member

mengqiy commented Dec 27, 2018

Ref: kubernetes/kubernetes#63750 and kubernetes/kubernetes#67476 are the upstream issue and PR.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2019
@mengqiy mengqiy added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels May 28, 2019
@DirectXMan12
Copy link
Contributor

/remove-lifecycle stale

looks like kube has fixed this upstream, so we can follow suit

/good-first-issue

@k8s-ci-robot
Copy link
Contributor

@DirectXMan12:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/remove-lifecycle stale

looks like kube has fixed this upstream, so we can follow suit

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Jun 4, 2019
DirectXMan12 pushed a commit that referenced this issue Jan 31, 2020
Update kubebuilder OWNERS
@vincepri vincepri removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 20, 2020
@vincepri vincepri added this to the Next milestone Feb 20, 2020
@vincepri
Copy link
Member

/priority awaiting-more-evidence

@k8s-ci-robot k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Feb 20, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2020
@irajdeep
Copy link
Contributor

Hi @DirectXMan12
We also have a use-case where we need more time duration buckets (> 10secs) for controller_runtime_reconcile_time_seconds_bucket histogram.
Should we add the buckets to controller-runtime metrics in accordance with the upstream changes done in this PR: kubernetes/kubernetes#73638. ?

irajdeep added a commit to irajdeep/controller-runtime that referenced this issue Nov 24, 2020
…gram metrics

Current metric uses Prometheus default bucket for reconcile time histogram. This bucket is
not sufficient to reason about percentile of requests which take less than x seconds when x falls outside
the bucket of 10 secs. It's also hard to infer when the reconcile loops are fairly fast, as mentioned in
this issue: kubernetes-sigs#258.

This PR attempts to define explicit buckets for the metrics, values are chosen based on the apiserver
request latency defined here: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L103.
The default Prometheus histogram buckets has also been added(wherever missing) to ensure backward compatibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants