Add documentation about controller metrics #4138

gnufied · 2017-06-19T22:56:18Z

Add documentation for controller metrics.

This change is

gnufied · 2017-06-20T11:35:28Z

brancz · 2017-06-20T12:04:38Z

Prometheus configuration wise this looks ok.

I'm not sure what the completeness goal for this PR is. I just did a quick query to figure out the metrics my currently running 1.6.4 kube-controller-manager exposes. This is my result.

gnufied · 2017-06-20T12:13:53Z

I agree with @piosz that, we may not want to list all metrics because maintaining such a list is hard and metric names or types can change. I would limit scope of this document to show a sneak preview of what cluster admin can expect from these metrics rather than making the list exhaustive.

piosz · 2017-06-20T13:12:53Z

How about writing a doc that is not relevant to only one component (controller-manager) and one collecting pipeline (Prometheus) instead?

gnufied · 2017-06-20T13:32:34Z

@piosz I do not disagree with you there, but I do not think such a document should be owned by #sig-storage. Also, prometheus's configuration of kubernetes service discovery has been a moving target between prometheus versions. In fact, I found most internet docs on prometheus kube sd configuration to be subtly wrong. There is stuff like, changing values from arrays to strings, renaming api_servers key to api_server etc. So, if we are going to document collection pipeline, we will have to pick up a version of prometheus and a version of kubernetes and stick with it. I feel, #sig-instrumentation should make such a decision that which prometheus version will be documented.

gnufied · 2017-06-20T13:41:18Z

Also, there is bit of opaqueness that needs documenting in some form or the other. Most users of Kubernetes I spoke of, are unware that controller metrics are available at http://localhost:10252/metrics in default configuration. AFAIK - we don't document this anywhere... :(

So I wanted to make that information as part of end user documentation. I think most users can configure the metric pipeline starting from that point onwards. It is tricky to pick a particular version of prometheus and document (at the risk of repeating myself).

saad-ali · 2017-06-20T21:00:51Z

CC @bowei @msau42

bowei · 2017-06-20T21:05:05Z

docs/concepts/cluster-administration/controller-metrics.md

+    protocol: TCP
+```
+
+After that prometheus's service discovery mechanism can automatically discover controller metrics and scrap them periodically as per configuration. Please refer to [Prometheus Configuration](https://prometheus.io/docs/operating/configuration/) for


scrap -> scrape

bowei · 2017-06-20T21:05:37Z

docs/concepts/cluster-administration/controller-metrics.md

+{% endcapture %}
+
+{% capture body %}
+## What are controller metrics


What are controller manager metrics

bowei · 2017-06-20T21:05:57Z

docs/concepts/cluster-administration/controller-metrics.md

+{% capture body %}
+## What are controller metrics
+
+Controller manager metrics provide important insight into performance and health of controller manager.


... health of the controller manager

bowei · 2017-06-20T21:06:21Z

docs/concepts/cluster-administration/controller-metrics.md

+
+Controller manager metrics provide important insight into performance and health of controller manager.
+These metrics include common Go language runtime metrics such as go_routine count and controller specif
+ic metrics such as


specif\nic unintentional newline?

bowei · 2017-06-20T21:06:36Z

docs/concepts/cluster-administration/controller-metrics.md

+Controller manager metrics provide important insight into performance and health of controller manager.
+These metrics include common Go language runtime metrics such as go_routine count and controller specif
+ic metrics such as
+etcd request latencies or cloudprovider (AWS, GCE, Openstack) api latencies that can be used


bowei · 2017-06-20T21:07:13Z

docs/concepts/cluster-administration/controller-metrics.md

+to gauge health of cluster.
+
+Starting from Kubernetes 1.7, detailed Cloudprovider metrics are available for storage operations for GCE, AWS, Vsphere and Openstack.
+These metrics can be used to monitor health of persistent volume operations.


The metrics cover all cloud operations, not just PV?

Only for GCE metrics are available for all operations, not for other cloudproviders..

bowei · 2017-06-20T21:08:17Z

docs/concepts/cluster-administration/controller-metrics.md

+
+## Configuration
+
+Typically in a cluster, controller metrics are available at - `http://localhost:10252/metrics` assuming


Seems simpler to just say:

Controller metrics are available from `http://localhost:10252/metrics` from the host where the controller manager is running.

bowei · 2017-06-20T21:09:04Z

docs/concepts/cluster-administration/controller-metrics.md

+Typically in a cluster, controller metrics are available at - `http://localhost:10252/metrics` assuming
+metrics are being retrieved locally from host where controller manager is running.
+
+The metrics are emitted in prometheus format and are human readable (go ahead curl that url!).


include link to prometheus format. Remove "(go head and curl that url!)"

bowei · 2017-06-20T21:09:35Z

docs/concepts/cluster-administration/controller-metrics.md

+
+The metrics are emitted in prometheus format and are human readable (go ahead curl that url!).
+
+In production environment though - you may want to configure prometheus or some other metrics scraper


In a production environment you may ...

bowei · 2017-06-20T21:11:27Z

docs/concepts/cluster-administration/controller-metrics.md

+    protocol: TCP
+```
+
+After that prometheus's service discovery mechanism can automatically discover controller metrics and scrap them periodically as per configuration. Please refer to [Prometheus Configuration](https://prometheus.io/docs/operating/configuration/) for


Is it possible to give a snippet of Prometheus configuration for accessing the service endpoint? I remember it wasn't super obvious from reading the Prometheus docs.

The Prometheus service discovery mechanism can scrape the service endpoint defined above.

It is possible yes, but it varies between versions. I have ran prometheus-1.1 in a pod and it was different from running prometheus-1.7 outside. Also, it depends on things like if controller-manager is running in a pod or not. I am bit hesitant to document something which will be soon out of date and doesn't necessarily aligns with release cycles of Kubernetes.

chenopis

Some minor grammar nits.

chenopis · 2017-06-21T16:05:59Z

docs/concepts/cluster-administration/controller-metrics.md

+
+{% capture overview %}
+Controller manager metrics provide important insight into performance and health of
+controller manager.


add "the": "...the controller manager."

chenopis · 2017-06-21T16:07:32Z

docs/concepts/cluster-administration/controller-metrics.md

+---
+
+{% capture overview %}
+Controller manager metrics provide important insight into performance and health of


add "the": "...insight into the performance and health..."

chenopis · 2017-06-21T16:07:54Z

docs/concepts/cluster-administration/controller-metrics.md

+{% capture body %}
+## What are controller manager metrics
+
+Controller manager metrics provide important insight into performance and health of the controller manager.


add "the": "...insight into the performance and health..."

chenopis · 2017-06-21T16:10:04Z

docs/concepts/cluster-administration/controller-metrics.md

+Controller manager metrics provide important insight into performance and health of the controller manager.
+These metrics include common Go language runtime metrics such as go_routine count and controller specific metrics such as
+etcd request latencies or cloudprovider (AWS, GCE, Openstack) API latencies that can be used
+to gauge health of cluster.


add "the" and "a": "...to gauge the health of a cluster."

chenopis · 2017-06-21T16:14:59Z

docs/concepts/cluster-administration/controller-metrics.md

+
+Controller manager metrics provide important insight into performance and health of the controller manager.
+These metrics include common Go language runtime metrics such as go_routine count and controller specific metrics such as
+etcd request latencies or cloudprovider (AWS, GCE, Openstack) API latencies that can be used


For consistency, Cloudprovider should be capitalized here.

chenopis · 2017-06-21T16:15:42Z

docs/concepts/cluster-administration/controller-metrics.md

+Starting from Kubernetes 1.7, detailed Cloudprovider metrics are available for storage operations for GCE, AWS, Vsphere and Openstack.
+These metrics can be used to monitor health of persistent volume operations.
+
+For example for GCE these metrics are called:


add comma: "For example, for GCE..."

chenopis · 2017-06-21T16:16:46Z

docs/concepts/cluster-administration/controller-metrics.md

+
+The metrics are emitted in [prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/) and are human readable.
+
+In production environment you may want to configure prometheus or some other metrics scraper


add "a" and comma: "In a production environment, you may..."

chenopis · 2017-06-21T16:19:17Z

docs/concepts/cluster-administration/controller-metrics.md

+to periodically gather these metrics and make them available in some kind of time series database.
+
+
+Prometheus itself can gather controller metrics via built-in service discovery mechanism provided


add "its", "that the", and a comma: "...controller metrics via its built-in service discovery mechanism, provided that the controller's metrics..."

chenopis · 2017-06-21T16:21:04Z

docs/concepts/cluster-administration/controller-metrics.md

+    protocol: TCP
+```
+
+After that prometheus's service discovery mechanism can automatically discover controller metrics and scrape them periodically as per configuration. Please refer to [Prometheus Configuration](https://prometheus.io/docs/operating/configuration/) for


add comma and "the": "After that, prometheus's service discovery mechanism can automatically discover controller metrics and scrape them periodically as per the configuration."

gnufied · 2017-06-21T16:38:37Z

@chenopis sorted the requested changes. Thanks for pointing them out!

chenopis · 2017-06-21T22:03:03Z

No worries. Ping me when we're good on the tech side and I'll merge this.

gnufied · 2017-06-22T16:22:00Z

I am hoping we get lgtm from someone in #sig-instrumentation. cc @piosz @brancz

piosz

As I mentioned I'm not a big fan of documenting a subset of metrics from one component in a separate page, but I can't find a better spot as for now. Long term we should introduce a page about monitoring Kubernetes components.

Please address the issue I raised below. Otherwise LGTM

piosz · 2017-06-26T13:17:32Z

docs/concepts/cluster-administration/controller-metrics.md

+to periodically gather these metrics and make them available in some kind of time series database.
+
+
+Prometheus itself can gather controller metrics via its built-in service discovery mechanism, provided


I'd prefer to have this monitoring solution agnostic, so how about skipping the rest of this doc (starting with this line)?

cc @brancz @fabxc

Agreed. I think it would be good to have a document describing metric collection for different mechanisms, but I don't think this document is the place for it.

okay, I dropped that whole section.

chenopis · 2017-06-26T16:20:07Z

@gnufied FYI, all feedback must be addressed and LGTMs given by EOD Tue, June 27th so that this can be merged for the 1.7 release on June 28th.

gnufied · 2017-06-26T17:07:07Z

@chenopis I believe I have addressed comments by @piosz . PTAL

gnufied · 2017-06-26T17:16:11Z

As strange as it may sound but Github just ate a commit I pushed for fixing the problem. Trying to figure out whats going on.

chenopis · 2017-06-26T18:05:12Z

Oh, GitHub. :(

chenopis · 2017-06-26T18:07:03Z

@gnufied I see the commit now, so is this ready to be merged then?

gnufied · 2017-06-26T18:10:39Z

@chenopis yes, I think Github was having issues. It looks all good now.

piosz · 2017-06-26T18:11:23Z

/lgtm

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 19, 2017

gnufied force-pushed the controller-metrics branch from ea94f6d to 4bd17b2 Compare June 19, 2017 22:58

k8s-github-robot assigned kelseyhightower and smarterclayton Jun 19, 2017

chenopis assigned chenopis and unassigned kelseyhightower Jun 19, 2017

chenopis requested a review from saad-ali June 19, 2017 23:17

chenopis added Needs Docs Review labels Jun 19, 2017

chenopis added this to the 1.7 milestone Jun 19, 2017

bowei reviewed Jun 20, 2017

View reviewed changes

gnufied force-pushed the controller-metrics branch from 4bd17b2 to d997764 Compare June 21, 2017 01:36

chenopis added Tech Review: Open Issues and removed Needs Tech Review labels Jun 21, 2017

chenopis suggested changes Jun 21, 2017

View reviewed changes

chenopis added Docs Review: Open Issues and removed Needs Docs Review labels Jun 21, 2017

gnufied force-pushed the controller-metrics branch from d997764 to cc40119 Compare June 21, 2017 16:38

chenopis approved these changes Jun 21, 2017

View reviewed changes

chenopis added Docs LGTM and removed Docs Review: Open Issues labels Jun 21, 2017

chenopis requested a review from piosz June 22, 2017 17:27

chenopis added Needs Tech Review and removed Tech Review: Open Issues labels Jun 22, 2017

piosz reviewed Jun 26, 2017

View reviewed changes

piosz assigned piosz and unassigned smarterclayton Jun 26, 2017

chenopis added Tech Review: Open Issues and removed Needs Tech Review labels Jun 26, 2017

Add documentation about controller metrics

99ff364

gnufied force-pushed the controller-metrics branch 2 times, most recently from 5676dfd to 99ff364 Compare June 26, 2017 17:44

chenopis added Tech Review LGTM and removed Tech Review: Open Issues labels Jun 26, 2017

k8s-ci-robot added the lgtm label Jun 26, 2017

chenopis merged commit 3d41757 into kubernetes:release-1.7 Jun 26, 2017


		## Configuration

		Typically in a cluster, controller metrics are available at - `http://localhost:10252/metrics` assuming


		The metrics are emitted in prometheus format and are human readable (go ahead curl that url!).

		In production environment though - you may want to configure prometheus or some other metrics scraper


		The metrics are emitted in [prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/) and are human readable.

		In production environment you may want to configure prometheus or some other metrics scraper

		to periodically gather these metrics and make them available in some kind of time series database.


		Prometheus itself can gather controller metrics via built-in service discovery mechanism provided

Add documentation about controller metrics #4138

Add documentation about controller metrics #4138

Conversation

gnufied commented Jun 19, 2017 • edited by k8s-reviewable Loading

gnufied commented Jun 20, 2017

brancz commented Jun 20, 2017

gnufied commented Jun 20, 2017

piosz commented Jun 20, 2017

gnufied commented Jun 20, 2017 • edited Loading

gnufied commented Jun 20, 2017

saad-ali commented Jun 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Jun 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenopis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented Jun 21, 2017

chenopis commented Jun 21, 2017

gnufied commented Jun 22, 2017

piosz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenopis commented Jun 26, 2017

gnufied commented Jun 26, 2017 • edited Loading

gnufied commented Jun 26, 2017

chenopis commented Jun 26, 2017

chenopis commented Jun 26, 2017

gnufied commented Jun 26, 2017

piosz commented Jun 26, 2017

gnufied commented Jun 19, 2017 •

edited by k8s-reviewable

Loading

gnufied commented Jun 20, 2017 •

edited

Loading

gnufied Jun 20, 2017 •

edited

Loading

gnufied commented Jun 26, 2017 •

edited

Loading