Metrics #1477

randomvariable · 2019-10-02T15:32:07Z

User Story

(copied from kubernetes-sigs/cluster-api-provider-aws#78 )

A cluster operator may want to metrics from the controllers available in their monitoring system of choice, e.g. Prometheus. They would be interested in metrics with regards to the creation, deletion, mutation and failure rates of AWS APIs, and the state of their objects at least to help diagnose when limits have been reached.

Goals

Produce sets of metrics around operations

Non-Goals

N/A

Anything else you would like to add:

Current version of controller-runtime should have OpenCensus support.

/kind feature
/priority important/long-term
/milestone next

k8s-ci-robot · 2019-10-02T15:32:08Z

@randomvariable: You must be a member of the kubernetes-sigs/cluster-api-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

User Story

(copied from kubernetes-sigs/cluster-api-provider-aws#78 )

A cluster operator may want to metrics from the controllers available in their monitoring system of choice, e.g. Prometheus. They would be interested in metrics with regards to the creation, deletion, mutation and failure rates of AWS APIs, and the state of their objects at least to help diagnose when limits have been reached.

Goals

Produce sets of metrics around operations

Non-Goals

N/A

Anything else you would like to add:

Current version of controller-runtime should have OpenCensus support.

/kind feature
/priority important/long-term
/milestone next

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2019-10-02T15:32:09Z

@randomvariable: The label(s) priority/important/long-term cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

User Story

(copied from kubernetes-sigs/cluster-api-provider-aws#78 )

A cluster operator may want to metrics from the controllers available in their monitoring system of choice, e.g. Prometheus. They would be interested in metrics with regards to the creation, deletion, mutation and failure rates of AWS APIs, and the state of their objects at least to help diagnose when limits have been reached.

Goals

Produce sets of metrics around operations

Non-Goals

N/A

Anything else you would like to add:

Current version of controller-runtime should have OpenCensus support.

/kind feature
/priority important/long-term
/milestone next

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vincepri · 2019-10-02T15:34:33Z

/priority important-long-term

k8s-ci-robot · 2019-10-02T15:34:34Z

@vincepri: The label(s) priority/important-long-term cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

/priority important-long-term

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wfernandes · 2019-10-08T17:56:55Z

Are we thinking of adding prometheus style metrics and an endpoint on the controller manager?

detiber · 2019-10-08T17:57:58Z

There should already be rudimentary support for the metrics endpoint through controller-runtime, should just need to add additional metrics.

ncdc · 2019-10-08T17:58:31Z

Yes. We already have the metrics server (

cluster-api/main.go

Line 102 in aa86af2

MetricsBindAddress: metricsAddr,

). +1 to what Jason wrote.

wfernandes · 2019-10-08T18:21:10Z

Thanks for the tips. I can start to take a look into adding some metrics.

wfernandes · 2019-10-08T18:23:27Z

/assign

randomvariable · 2019-10-08T18:32:46Z

Let me know when you start work on it @wfernandes and we'll mark it as lifecycle/active

wfernandes · 2019-10-10T22:51:30Z

/lifecycle active
Wanted to start a discussion regarding what metrics do we want our controllers to track.

This file (capi-metrics.txt) contains some of the current metrics provided by the controller_runtime framework.

I noticed we have an event recorder that is being used by some of the controllers to track important events. For example, in the machine controller we have FailedDrainNode and FailedDeleteNode events. We could start there for now.

I would like for the additional metrics we expose to be useful to an operator such that they may want to set up an alerting rule based on these metrics for example.

wfernandes · 2019-10-10T23:08:39Z

Adding these for reference:
Kubernetes Instrumentation Guidelines: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/instrumentation.md

detiber · 2019-10-11T12:07:11Z

We've talked about adding additional events to the Machine. For example, I think it would make sense to emit an event based on the following things:

Status.BootstrapReady is set
Status.InfrastructureReady is set
Status.NodeRef is set
ErrorReason/ErrorMessage is set

randomvariable · 2019-10-11T13:11:15Z

We've also just added https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/docs/observability.md to the AWS provider, a similar doc here would also be useful to consumers.

harishspqr · 2019-10-14T20:22:09Z

It would be great if the cluster/machine deployment create/delete/resize latencies can be included in the CRD status field once they are created.

ncdc · 2019-10-14T20:23:22Z

@harishspqr if we record them, they wouldn't be included in the status field. They would be exposed as metrics accessible from the metrics server port.

detiber · 2019-10-14T20:25:50Z

@harishspqr What type of calculation are you looking for out of these latencies?

The duration from request to completion?
The duration from request to start of provisioning?
The duration from start of provisioning to finished provisioning?
All of the above?

fejta-bot · 2020-01-12T20:55:05Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vincepri · 2020-01-13T04:43:52Z

/lifecycle frozen

sbueringer · 2022-01-07T14:45:43Z

/reopen
/remove-lifecycle rotten

k8s-ci-robot · 2022-01-07T14:45:57Z

@sbueringer: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vincepri · 2022-02-11T17:28:27Z

@fabriziopandini to split this issue into multiple ones and provide additional context / resolution

fabriziopandini · 2022-02-14T10:54:00Z

I think the best way forward for the metrics effort is to join forces with https://github.com/mercedes-benz/cluster-api-state-metrics in order to provide a first set of "core metrics" about the Cluster API API objects.
What I really like about this approach is that metrics are a pluggable component, running out of the CAPI/provider process, providing not only valuable insights but also a first validation about the fact that our APIs provides all the info that an operator is looking for

WRT to implementing metrics inside CAPI, I will limit this to a limited number of use cases TBD and focused on internal functioning of the controllers, e.g. :

counter of errors while connecting remote Clusters
average duration of runtime extension calls
etc.

I will raise the point at the next office hours, to see if there is agreement on this course of actions

enxebre · 2022-02-16T19:00:08Z

WRT to implementing metrics inside CAPI, I will limit this to a limited number of use cases TBD and focused on internal functioning of the controllers, e.g. :
counter of errors while connecting remote Clusters
average duration of runtime extension calls
etc.

This make sense to me, this can be used as basic SLIs that e.g CI systems consume to ensure SLOs are met.

Regarding https://github.com/mercedes-benz/cluster-api-state-metrics which looks awesome, in order to move it towards a community pluggable component solution I think It'd be good to have a common doc/issue to discuss and agree on all use cases/goals we want to achieve with those metrics.

vincepri · 2022-02-17T04:53:25Z

We could also ask the maintainers of cluster-api-state-metrics if they might be interested in donating the project to CNCF? We could then incorporate it within Cluster API and ship it as an extra component

sbueringer · 2022-02-17T13:50:29Z

If I understood correctly, maintainers offered that already :) (@chrischdi @tobiasgiese please confirm :))

@vincepri What would be the road towards donation?

chrischdi · 2022-02-17T14:10:53Z

Yes that was one possible target or reason for us to even publish it to github.

I think the donation would help to get more contributions for the project and would increase acceptance to use it.

We try to talk internally to the people to clarify if proceeding is okay for them (to accomplish the formalities) 👍

chrischdi · 2022-02-23T15:44:26Z

So after having a first talk, we would be willed to donate the project :-)

Would a donation to the kubernetes-sig be a better fit than donating it to the CNCF?

If contributing to the kubernetes-sig the requirements are documented here: source

Which would require to

relicense the code to Apache 2.0 (should be also okay from our side)
ensure all current Contributors have signed the CNCF CLA (currently the case)
ensure license of dependencies are acceptable (should be the case)
adopt the CNCF CLA bot, merge bot and Kubernetes PR commands/bots
All OWNERS of the project must also be active SIG members (currently me and @tobiasgiese, both active at cluster-api-provider-openstack as maintainers)
Must be approved by the process spelled out in the SIG's charter and a publicly linkable written decision should be available for the same.
SIG must already have identified all of their existing subprojects and code, with valid OWNERS files, in sigs.yaml

fabriziopandini · 2022-02-24T14:33:40Z

@chrischdi that's great news!
let's discuss the next steps during office hours!

Personally I'm also intrigued by Vince's suggestion

We could then incorporate it within Cluster API and ship it as an extra component

Because this can help in having a quicker start (it is just a PR), and most importantly it allows a stricter collaboration with the CAPI core team while migrating the project to v1beta1, consolidating the first set of metrics etc. but I will be happy with both options

sbueringer · 2022-02-24T14:39:17Z

I agree, it's an interesting option to integrate it in the core CAPI repo. It would definitely make it a lot easier to integrate it in Tilt, the release process, ... etc.

I think in that case we would create a new (probably top-level) folder in the CAPI repo with an OWNER file and would add the current maintainers as approvers there.

From a high-level perspective, it would essentially be a core component of ClusterAPI.

chrischdi · 2022-02-24T15:29:21Z

That sounds good. I will attend next week 👍 .

@tobiasgiese or @johannesfrey : would be great if one of you could also free up some time :-)

sbueringer · 2022-02-24T15:31:48Z

One additional point:

I think it also helps us to close the feedback loop for CAPI developers for metrics better. Just like we currently try to make logging part of our development workflow.

If we're able to include metrics and logging in our development workflows (especially for debugging) it will help to continuously improve both and we will slowly considering them as part of the ClusterAPI "deliverable". Essentially take on responsibility not only for the behavior/implementation of our controllers but also about how observable/ operable / debuggable they are.

I think this helps us to get closer to the principle, that usually things improve once the folks responsible for development are using them themselves and rely on them (in it's most extreme form if they are on call for it, but in our case if we integrate them into dev workflows).

tobiasgiese · 2022-02-24T15:41:36Z

@tobiasgiese or @johannesfrey : would be great if one of you could also free up some time :-)

Sure, I'll attend also, looking forward to it

johannesfrey · 2022-02-24T15:59:24Z

@tobiasgiese or @johannesfrey : would be great if one of you could also free up some time :-)

Sure, I'll attend also, looking forward to it

Sure, count me in as well :).

chrischdi · 2022-03-02T19:00:38Z

First of all thanks for the discussion in the office hours.

As follow-up: so the next steps would be:

to add a small proposal about the feature set and wanted metrics which then targets to
integrate it as a next step into https://github.com/kubernetes-sigs/cluster-api/tree/main/exp via a PR

tobiasgiese · 2022-03-17T12:10:36Z

We created a first draft of the Cluster API State Metrics proposal
Thanks @chrischdi and @johannesfrey 🙂🎉

killianmuldoon · 2022-04-11T10:51:42Z

If there's no objections I think we should close this issue now that #6404 is open. For future conversations around metrics it's better IMO to have issue with a tighter scope.

fabriziopandini · 2022-04-11T11:48:11Z

+1 to close this as soon and #6404 merges

sbueringer · 2022-04-26T13:29:47Z

I think #6404 only addresses state metrics. This still leaves metrics exposed by our controllers. Did I understand correctly that we would open a new tightly scoped issue for this before closing this issue?

killianmuldoon · 2022-05-03T12:38:30Z

@sbueringer I think this issue is very old and has gone through too many transformations to be really useful anymore so any future work should go in newer more scoped issues.

I'm not sure exactly what specific metrics exposed by controllers we'd be missing by closing this issue now though.

fabriziopandini · 2022-05-08T22:41:18Z

/close
we can open new issue as soon as we have actionable items for metrics exposed by our controllers

k8s-ci-robot · 2022-05-08T22:41:27Z

@fabriziopandini: Closing this issue.

In response to this:

/close
we can open new issue as soon as we have actionable items for metrics exposed by our controllers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 2, 2019

vincepri added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Oct 2, 2019

vincepri added this to the Next milestone Oct 2, 2019

k8s-ci-robot assigned wfernandes Oct 8, 2019

k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Oct 10, 2019

ncdc mentioned this issue Oct 24, 2019

Add tracing and metrics - Open Telemetry kubernetes-sigs/cluster-api-provider-azure#311

Closed

wfernandes mentioned this issue Oct 25, 2019

REQUEST: New membership for @wfernandes kubernetes/org#1339

Closed

6 tasks

This was referenced Nov 7, 2019

✨Add metrics to machine and cluster controller #1736

Merged

📖Add documentation for metrics #1763

Merged

k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Jan 12, 2020

k8s-ci-robot reopened this Jan 7, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 7, 2022

fabriziopandini modified the milestones: v0.4, Next Jan 26, 2022

chrischdi mentioned this issue Apr 11, 2022

📖 Cluster API State Metrics proposal #6404

Merged

chrischdi mentioned this issue Apr 28, 2022

Cluster API State Metrics implementation #6458

Closed

9 tasks

k8s-ci-robot closed this as completed May 8, 2022

Metrics #1477

Metrics #1477

Comments

randomvariable commented Oct 2, 2019

k8s-ci-robot commented Oct 2, 2019

k8s-ci-robot commented Oct 2, 2019

vincepri commented Oct 2, 2019

k8s-ci-robot commented Oct 2, 2019

wfernandes commented Oct 8, 2019

detiber commented Oct 8, 2019

ncdc commented Oct 8, 2019

wfernandes commented Oct 8, 2019

wfernandes commented Oct 8, 2019

randomvariable commented Oct 8, 2019

wfernandes commented Oct 10, 2019

wfernandes commented Oct 10, 2019

detiber commented Oct 11, 2019

randomvariable commented Oct 11, 2019

harishspqr commented Oct 14, 2019

ncdc commented Oct 14, 2019

detiber commented Oct 14, 2019

fejta-bot commented Jan 12, 2020

vincepri commented Jan 13, 2020

sbueringer commented Jan 7, 2022

k8s-ci-robot commented Jan 7, 2022

vincepri commented Feb 11, 2022

fabriziopandini commented Feb 14, 2022

enxebre commented Feb 16, 2022

vincepri commented Feb 17, 2022

sbueringer commented Feb 17, 2022 • edited Loading

chrischdi commented Feb 17, 2022

chrischdi commented Feb 23, 2022

fabriziopandini commented Feb 24, 2022 • edited Loading

sbueringer commented Feb 24, 2022 • edited Loading

chrischdi commented Feb 24, 2022

sbueringer commented Feb 24, 2022 • edited Loading

tobiasgiese commented Feb 24, 2022

johannesfrey commented Feb 24, 2022

chrischdi commented Mar 2, 2022

tobiasgiese commented Mar 17, 2022 • edited Loading

killianmuldoon commented Apr 11, 2022

fabriziopandini commented Apr 11, 2022

sbueringer commented Apr 26, 2022

killianmuldoon commented May 3, 2022

fabriziopandini commented May 8, 2022

k8s-ci-robot commented May 8, 2022

sbueringer commented Feb 17, 2022 •

edited

Loading

fabriziopandini commented Feb 24, 2022 •

edited

Loading

sbueringer commented Feb 24, 2022 •

edited

Loading

sbueringer commented Feb 24, 2022 •

edited

Loading

tobiasgiese commented Mar 17, 2022 •

edited

Loading