Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics #1477

Closed
randomvariable opened this issue Oct 2, 2019 · 64 comments
Closed

Metrics #1477

randomvariable opened this issue Oct 2, 2019 · 64 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@randomvariable
Copy link
Member

User Story

(copied from kubernetes-sigs/cluster-api-provider-aws#78 )

A cluster operator may want to metrics from the controllers available in their monitoring system of choice, e.g. Prometheus. They would be interested in metrics with regards to the creation, deletion, mutation and failure rates of AWS APIs, and the state of their objects at least to help diagnose when limits have been reached.

Goals

  • Produce sets of metrics around operations

Non-Goals

  • N/A

Anything else you would like to add:

Current version of controller-runtime should have OpenCensus support.

/kind feature
/priority important/long-term
/milestone next

@k8s-ci-robot
Copy link
Contributor

@randomvariable: You must be a member of the kubernetes-sigs/cluster-api-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

User Story

(copied from kubernetes-sigs/cluster-api-provider-aws#78 )

A cluster operator may want to metrics from the controllers available in their monitoring system of choice, e.g. Prometheus. They would be interested in metrics with regards to the creation, deletion, mutation and failure rates of AWS APIs, and the state of their objects at least to help diagnose when limits have been reached.

Goals

  • Produce sets of metrics around operations

Non-Goals

  • N/A

Anything else you would like to add:

Current version of controller-runtime should have OpenCensus support.

/kind feature
/priority important/long-term
/milestone next

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 2, 2019
@k8s-ci-robot
Copy link
Contributor

@randomvariable: The label(s) priority/important/long-term cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

User Story

(copied from kubernetes-sigs/cluster-api-provider-aws#78 )

A cluster operator may want to metrics from the controllers available in their monitoring system of choice, e.g. Prometheus. They would be interested in metrics with regards to the creation, deletion, mutation and failure rates of AWS APIs, and the state of their objects at least to help diagnose when limits have been reached.

Goals

  • Produce sets of metrics around operations

Non-Goals

  • N/A

Anything else you would like to add:

Current version of controller-runtime should have OpenCensus support.

/kind feature
/priority important/long-term
/milestone next

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vincepri
Copy link
Member

vincepri commented Oct 2, 2019

/priority important-long-term

@k8s-ci-robot
Copy link
Contributor

@vincepri: The label(s) priority/important-long-term cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

/priority important-long-term

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vincepri vincepri added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Oct 2, 2019
@vincepri vincepri added this to the Next milestone Oct 2, 2019
@wfernandes
Copy link
Contributor

Are we thinking of adding prometheus style metrics and an endpoint on the controller manager?

@detiber
Copy link
Member

detiber commented Oct 8, 2019

There should already be rudimentary support for the metrics endpoint through controller-runtime, should just need to add additional metrics.

@ncdc
Copy link
Contributor

ncdc commented Oct 8, 2019

Yes. We already have the metrics server (

MetricsBindAddress: metricsAddr,
). +1 to what Jason wrote.

@wfernandes
Copy link
Contributor

Thanks for the tips. I can start to take a look into adding some metrics.

@wfernandes
Copy link
Contributor

/assign

@randomvariable
Copy link
Member Author

Let me know when you start work on it @wfernandes and we'll mark it as lifecycle/active

@wfernandes
Copy link
Contributor

/lifecycle active
Wanted to start a discussion regarding what metrics do we want our controllers to track.

This file (capi-metrics.txt) contains some of the current metrics provided by the controller_runtime framework.

I noticed we have an event recorder that is being used by some of the controllers to track important events. For example, in the machine controller we have FailedDrainNode and FailedDeleteNode events. We could start there for now.

I would like for the additional metrics we expose to be useful to an operator such that they may want to set up an alerting rule based on these metrics for example.

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Oct 10, 2019
@wfernandes
Copy link
Contributor

Adding these for reference:
Kubernetes Instrumentation Guidelines: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/instrumentation.md

@detiber
Copy link
Member

detiber commented Oct 11, 2019

We've talked about adding additional events to the Machine. For example, I think it would make sense to emit an event based on the following things:

  • Status.BootstrapReady is set
  • Status.InfrastructureReady is set
  • Status.NodeRef is set
  • ErrorReason/ErrorMessage is set

@randomvariable
Copy link
Member Author

We've also just added https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/docs/observability.md to the AWS provider, a similar doc here would also be useful to consumers.

@harishspqr
Copy link

It would be great if the cluster/machine deployment create/delete/resize latencies can be included in the CRD status field once they are created.

@ncdc
Copy link
Contributor

ncdc commented Oct 14, 2019

@harishspqr if we record them, they wouldn't be included in the status field. They would be exposed as metrics accessible from the metrics server port.

@detiber
Copy link
Member

detiber commented Oct 14, 2019

@harishspqr What type of calculation are you looking for out of these latencies?

  • The duration from request to completion?
  • The duration from request to start of provisioning?
  • The duration from start of provisioning to finished provisioning?
  • All of the above?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Jan 12, 2020
@vincepri
Copy link
Member

/lifecycle frozen

@sbueringer
Copy link
Member

/reopen
/remove-lifecycle rotten

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jan 7, 2022
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 7, 2022
@fabriziopandini fabriziopandini modified the milestones: v0.4, Next Jan 26, 2022
@vincepri
Copy link
Member

@fabriziopandini to split this issue into multiple ones and provide additional context / resolution

@fabriziopandini
Copy link
Member

I think the best way forward for the metrics effort is to join forces with https://github.com/mercedes-benz/cluster-api-state-metrics in order to provide a first set of "core metrics" about the Cluster API API objects.
What I really like about this approach is that metrics are a pluggable component, running out of the CAPI/provider process, providing not only valuable insights but also a first validation about the fact that our APIs provides all the info that an operator is looking for

WRT to implementing metrics inside CAPI, I will limit this to a limited number of use cases TBD and focused on internal functioning of the controllers, e.g. :

  • counter of errors while connecting remote Clusters
  • average duration of runtime extension calls
  • etc.

I will raise the point at the next office hours, to see if there is agreement on this course of actions

@enxebre
Copy link
Member

enxebre commented Feb 16, 2022

WRT to implementing metrics inside CAPI, I will limit this to a limited number of use cases TBD and focused on internal functioning of the controllers, e.g. :
counter of errors while connecting remote Clusters
average duration of runtime extension calls
etc.

This make sense to me, this can be used as basic SLIs that e.g CI systems consume to ensure SLOs are met.

Regarding https://github.com/mercedes-benz/cluster-api-state-metrics which looks awesome, in order to move it towards a community pluggable component solution I think It'd be good to have a common doc/issue to discuss and agree on all use cases/goals we want to achieve with those metrics.

@vincepri
Copy link
Member

We could also ask the maintainers of cluster-api-state-metrics if they might be interested in donating the project to CNCF? We could then incorporate it within Cluster API and ship it as an extra component

@sbueringer
Copy link
Member

sbueringer commented Feb 17, 2022

If I understood correctly, maintainers offered that already :) (@chrischdi @tobiasgiese please confirm :))

@vincepri What would be the road towards donation?

@chrischdi
Copy link
Member

Yes that was one possible target or reason for us to even publish it to github.

I think the donation would help to get more contributions for the project and would increase acceptance to use it.

We try to talk internally to the people to clarify if proceeding is okay for them (to accomplish the formalities) 👍

@chrischdi
Copy link
Member

So after having a first talk, we would be willed to donate the project :-)

Would a donation to the kubernetes-sig be a better fit than donating it to the CNCF?

If contributing to the kubernetes-sig the requirements are documented here: source

Which would require to

  • relicense the code to Apache 2.0 (should be also okay from our side)
  • ensure all current Contributors have signed the CNCF CLA (currently the case)
  • ensure license of dependencies are acceptable (should be the case)
  • adopt the CNCF CLA bot, merge bot and Kubernetes PR commands/bots
  • All OWNERS of the project must also be active SIG members (currently me and @tobiasgiese, both active at cluster-api-provider-openstack as maintainers)
  • Must be approved by the process spelled out in the SIG's charter and a publicly linkable written decision should be available for the same.
  • SIG must already have identified all of their existing subprojects and code, with valid OWNERS files, in sigs.yaml

@fabriziopandini
Copy link
Member

fabriziopandini commented Feb 24, 2022

@chrischdi that's great news!
let's discuss the next steps during office hours!

Personally I'm also intrigued by Vince's suggestion

We could then incorporate it within Cluster API and ship it as an extra component

Because this can help in having a quicker start (it is just a PR), and most importantly it allows a stricter collaboration with the CAPI core team while migrating the project to v1beta1, consolidating the first set of metrics etc. but I will be happy with both options

@sbueringer
Copy link
Member

sbueringer commented Feb 24, 2022

I agree, it's an interesting option to integrate it in the core CAPI repo. It would definitely make it a lot easier to integrate it in Tilt, the release process, ... etc.

I think in that case we would create a new (probably top-level) folder in the CAPI repo with an OWNER file and would add the current maintainers as approvers there.

From a high-level perspective, it would essentially be a core component of ClusterAPI.

@chrischdi
Copy link
Member

That sounds good. I will attend next week 👍 .

@tobiasgiese or @johannesfrey : would be great if one of you could also free up some time :-)

@sbueringer
Copy link
Member

sbueringer commented Feb 24, 2022

One additional point:

I think it also helps us to close the feedback loop for CAPI developers for metrics better. Just like we currently try to make logging part of our development workflow.

If we're able to include metrics and logging in our development workflows (especially for debugging) it will help to continuously improve both and we will slowly considering them as part of the ClusterAPI "deliverable". Essentially take on responsibility not only for the behavior/implementation of our controllers but also about how observable/ operable / debuggable they are.

I think this helps us to get closer to the principle, that usually things improve once the folks responsible for development are using them themselves and rely on them (in it's most extreme form if they are on call for it, but in our case if we integrate them into dev workflows).

@tobiasgiese
Copy link
Member

@tobiasgiese or @johannesfrey : would be great if one of you could also free up some time :-)

Sure, I'll attend also, looking forward to it

@johannesfrey
Copy link
Contributor

@tobiasgiese or @johannesfrey : would be great if one of you could also free up some time :-)

Sure, I'll attend also, looking forward to it

Sure, count me in as well :).

@chrischdi
Copy link
Member

First of all thanks for the discussion in the office hours.

As follow-up: so the next steps would be:

@tobiasgiese
Copy link
Member

tobiasgiese commented Mar 17, 2022

We created a first draft of the Cluster API State Metrics proposal
Thanks @chrischdi and @johannesfrey 🙂🎉

@killianmuldoon
Copy link
Contributor

If there's no objections I think we should close this issue now that #6404 is open. For future conversations around metrics it's better IMO to have issue with a tighter scope.

@fabriziopandini
Copy link
Member

+1 to close this as soon and #6404 merges

@sbueringer
Copy link
Member

I think #6404 only addresses state metrics. This still leaves metrics exposed by our controllers. Did I understand correctly that we would open a new tightly scoped issue for this before closing this issue?

@killianmuldoon
Copy link
Contributor

@sbueringer I think this issue is very old and has gone through too many transformations to be really useful anymore so any future work should go in newer more scoped issues.

I'm not sure exactly what specific metrics exposed by controllers we'd be missing by closing this issue now though.

@fabriziopandini
Copy link
Member

/close
we can open new issue as soon as we have actionable items for metrics exposed by our controllers

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close
we can open new issue as soon as we have actionable items for metrics exposed by our controllers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests