-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation about controller metrics #4138
Conversation
ea94f6d
to
4bd17b2
Compare
Prometheus configuration wise this looks ok. I'm not sure what the completeness goal for this PR is. I just did a quick query to figure out the metrics my currently running 1.6.4 kube-controller-manager exposes. This is my result. |
I agree with @piosz that, we may not want to list all metrics because maintaining such a list is hard and metric names or types can change. I would limit scope of this document to show a sneak preview of what cluster admin can expect from these metrics rather than making the list exhaustive. |
How about writing a doc that is not relevant to only one component (controller-manager) and one collecting pipeline (Prometheus) instead? |
@piosz I do not disagree with you there, but I do not think such a document should be owned by #sig-storage. Also, prometheus's configuration of kubernetes service discovery has been a moving target between prometheus versions. In fact, I found most internet docs on prometheus kube sd configuration to be subtly wrong. There is stuff like, changing values from arrays to strings, renaming |
Also, there is bit of opaqueness that needs documenting in some form or the other. Most users of Kubernetes I spoke of, are unware that controller metrics are available at So I wanted to make that information as part of end user documentation. I think most users can configure the metric pipeline starting from that point onwards. It is tricky to pick a particular version of prometheus and document (at the risk of repeating myself). |
protocol: TCP | ||
``` | ||
|
||
After that prometheus's service discovery mechanism can automatically discover controller metrics and scrap them periodically as per configuration. Please refer to [Prometheus Configuration](https://prometheus.io/docs/operating/configuration/) for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scrap -> scrape
{% endcapture %} | ||
|
||
{% capture body %} | ||
## What are controller metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are controller manager metrics
{% capture body %} | ||
## What are controller metrics | ||
|
||
Controller manager metrics provide important insight into performance and health of controller manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... health of the controller manager
|
||
Controller manager metrics provide important insight into performance and health of controller manager. | ||
These metrics include common Go language runtime metrics such as go_routine count and controller specif | ||
ic metrics such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specif\nic
unintentional newline?
Controller manager metrics provide important insight into performance and health of controller manager. | ||
These metrics include common Go language runtime metrics such as go_routine count and controller specif | ||
ic metrics such as | ||
etcd request latencies or cloudprovider (AWS, GCE, Openstack) api latencies that can be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
api -> API
to gauge health of cluster. | ||
|
||
Starting from Kubernetes 1.7, detailed Cloudprovider metrics are available for storage operations for GCE, AWS, Vsphere and Openstack. | ||
These metrics can be used to monitor health of persistent volume operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics cover all cloud operations, not just PV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only for GCE metrics are available for all operations, not for other cloudproviders..
|
||
## Configuration | ||
|
||
Typically in a cluster, controller metrics are available at - `http://localhost:10252/metrics` assuming |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems simpler to just say:
Controller metrics are available from `http://localhost:10252/metrics` from the host where the controller manager is running.
Typically in a cluster, controller metrics are available at - `http://localhost:10252/metrics` assuming | ||
metrics are being retrieved locally from host where controller manager is running. | ||
|
||
The metrics are emitted in prometheus format and are human readable (go ahead curl that url!). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include link to prometheus format. Remove "(go head and curl that url!)"
|
||
The metrics are emitted in prometheus format and are human readable (go ahead curl that url!). | ||
|
||
In production environment though - you may want to configure prometheus or some other metrics scraper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a production environment you may ...
protocol: TCP | ||
``` | ||
|
||
After that prometheus's service discovery mechanism can automatically discover controller metrics and scrap them periodically as per configuration. Please refer to [Prometheus Configuration](https://prometheus.io/docs/operating/configuration/) for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to give a snippet of Prometheus configuration for accessing the service endpoint? I remember it wasn't super obvious from reading the Prometheus docs.
The Prometheus service discovery mechanism can scrape the service endpoint defined above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible yes, but it varies between versions. I have ran prometheus-1.1 in a pod and it was different from running prometheus-1.7 outside. Also, it depends on things like if controller-manager is running in a pod or not. I am bit hesitant to document something which will be soon out of date and doesn't necessarily aligns with release cycles of Kubernetes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
4bd17b2
to
d997764
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor grammar nits.
|
||
{% capture overview %} | ||
Controller manager metrics provide important insight into performance and health of | ||
controller manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "the": "...the
controller manager."
--- | ||
|
||
{% capture overview %} | ||
Controller manager metrics provide important insight into performance and health of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "the": "...insight into the
performance and health..."
{% capture body %} | ||
## What are controller manager metrics | ||
|
||
Controller manager metrics provide important insight into performance and health of the controller manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "the": "...insight into the
performance and health..."
Controller manager metrics provide important insight into performance and health of the controller manager. | ||
These metrics include common Go language runtime metrics such as go_routine count and controller specific metrics such as | ||
etcd request latencies or cloudprovider (AWS, GCE, Openstack) API latencies that can be used | ||
to gauge health of cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "the" and "a": "...to gauge the
health of a
cluster."
|
||
Controller manager metrics provide important insight into performance and health of the controller manager. | ||
These metrics include common Go language runtime metrics such as go_routine count and controller specific metrics such as | ||
etcd request latencies or cloudprovider (AWS, GCE, Openstack) API latencies that can be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, Cloudprovider should be capitalized here.
Starting from Kubernetes 1.7, detailed Cloudprovider metrics are available for storage operations for GCE, AWS, Vsphere and Openstack. | ||
These metrics can be used to monitor health of persistent volume operations. | ||
|
||
For example for GCE these metrics are called: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comma: "For example, for GCE..."
|
||
The metrics are emitted in [prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/) and are human readable. | ||
|
||
In production environment you may want to configure prometheus or some other metrics scraper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "a" and comma: "In a
production environment, you may..."
to periodically gather these metrics and make them available in some kind of time series database. | ||
|
||
|
||
Prometheus itself can gather controller metrics via built-in service discovery mechanism provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "its", "that the", and a comma: "...controller metrics via its
built-in service discovery mechanism,
provided that the
controller's metrics..."
protocol: TCP | ||
``` | ||
|
||
After that prometheus's service discovery mechanism can automatically discover controller metrics and scrape them periodically as per configuration. Please refer to [Prometheus Configuration](https://prometheus.io/docs/operating/configuration/) for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comma and "the": "After that,
prometheus's service discovery mechanism can automatically discover controller metrics and scrape them periodically as per the
configuration."
d997764
to
cc40119
Compare
@chenopis sorted the requested changes. Thanks for pointing them out! |
No worries. Ping me when we're good on the tech side and I'll merge this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned I'm not a big fan of documenting a subset of metrics from one component in a separate page, but I can't find a better spot as for now. Long term we should introduce a page about monitoring Kubernetes components.
Please address the issue I raised below. Otherwise LGTM
to periodically gather these metrics and make them available in some kind of time series database. | ||
|
||
|
||
Prometheus itself can gather controller metrics via its built-in service discovery mechanism, provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I think it would be good to have a document describing metric collection for different mechanisms, but I don't think this document is the place for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I dropped that whole section.
@gnufied FYI, all feedback must be addressed and LGTMs given by EOD Tue, June 27th so that this can be merged for the 1.7 release on June 28th. |
As strange as it may sound but Github just ate a commit I pushed for fixing the problem. Trying to figure out whats going on. |
5676dfd
to
99ff364
Compare
Oh, GitHub. :( |
@gnufied I see the commit now, so is this ready to be merged then? |
@chenopis yes, I think Github was having issues. It looks all good now. |
/lgtm |
Add documentation for controller metrics.
cc @chenopis @saad-ali
This change is