Scape etcd targets in Prometheus addon #114

dghubble · 2018-01-28T20:25:31Z

Feature

Configure the prometheus addon to scape Typhoon etcd targets on controller nodes. Then, metrics from etcd will be available in Prometheus. Alert rules for etcd will fire during incidents. The etcd dashboard provided with the grafana addon will be populated.

Invariants:

Users still only need to choose to kubectl apply the addon manifests. Nothing more.
Users never need to fiddle with listing etcd nodes on any platform.

Background

The prometheus addon manifests setup Prometheus 2.1 (#113) to scape apiservers, kubelets, services, endpoints, cAdvisor, and exporters (kube-state-metrics and node_exporter). Alerting rules and Grafana graphs in addons correspond to these metrics. However, etcd rules and graphs currently aren't active/populated.

Situation

Prometheus's can be configured (via the ConfigMap) to scrape the secured :2379/metrics endpoints of etcd nodes just like any other target. The etcd cluster runs on-host, across controllers with systemd, it is a lower-level component on which Kubernetes relies (not atop k8s), and it handles its own client authentication already.

Typhoon runs etcd on-host, across controllers, on all platforms
Typhoon requires etcd be setup with TLS on all platforms
Typhoon creates etcd client certs, but only places them on controller nodes

To perform the scrapes, Prometheus needs the etcd client certificates to write the tls_config section in a new scape job.

Options

Add etcd client materials in a kube-system secret. We did this back when self-hosted etcd was explored.
- Pro: Allows prometheus pod to be scheduled on any node
- Con: Opens up the possibility of escalation attacks (i.e. read kube-system secrets == read everything)
Mount the etcd client materials from a controller host. (current most viable)
- Pro: Avoid keeping etcd client materials in a Kubernetes secret
- Con: Restricts prometheus pod itself to run on controller nodes
Explore whether its possible to create (or invent) "metrics-only" etcd certificates
- Likely not on the roadmap
Metrics whitelist proxy
- Con: I use some whitelist proxies for some internal things. They're gross though.

The text was updated successfully, but these errors were encountered:

dghubble · 2018-02-01T13:12:34Z

I have an example Prometheus config for scraping etcd that works alright, but needs some modifications:

etcd TLS certs have user:group etcd:etcd on host which isn't readable within the prometheus pod (which runs as nobody) without modification. fsGroup doesn't apply to hostPath volumes.
Filtering out workers (etcd always runs on Typhoon controllers) in Prometheus relabel_config is troublesome because Kubernetes controllers and workers are labeled:
- node-role.kubernetes.io/master=""
- node-role.kubernetes.io/node=""

Relabel matching doesn't seem able to distinguish between a label being present vs not, as "" means the same as the label not being present.

The first issue can be addressed by adapting the etcd TLS ownership. The second can be addressed by adding an additional controller label that has a key and value or finding a better Prometheus relabel trick.

dghubble mentioned this issue Jan 28, 2018

control plane monitoring/alerting by default #87

Closed

dghubble mentioned this issue Mar 29, 2018

Add etcd metrics, Prometheus scrapes, and Grafana dash #175

Merged

dghubble closed this as completed in #175 Apr 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scape etcd targets in Prometheus addon #114

Scape etcd targets in Prometheus addon #114

dghubble commented Jan 28, 2018 •

edited

Loading

dghubble commented Feb 1, 2018 •

edited

Loading

Scape etcd targets in Prometheus addon #114

Scape etcd targets in Prometheus addon #114

Comments

dghubble commented Jan 28, 2018 • edited Loading

Feature

Background

Situation

Options

dghubble commented Feb 1, 2018 • edited Loading

dghubble commented Jan 28, 2018 •

edited

Loading

dghubble commented Feb 1, 2018 •

edited

Loading