Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scape etcd targets in Prometheus addon #114

Closed
dghubble opened this issue Jan 28, 2018 · 1 comment
Closed

Scape etcd targets in Prometheus addon #114

dghubble opened this issue Jan 28, 2018 · 1 comment

Comments

@dghubble
Copy link
Member

dghubble commented Jan 28, 2018

Feature

Configure the prometheus addon to scape Typhoon etcd targets on controller nodes. Then, metrics from etcd will be available in Prometheus. Alert rules for etcd will fire during incidents. The etcd dashboard provided with the grafana addon will be populated.

Invariants:

  • Users still only need to choose to kubectl apply the addon manifests. Nothing more.
  • Users never need to fiddle with listing etcd nodes on any platform.

Background

The prometheus addon manifests setup Prometheus 2.1 (#113) to scape apiservers, kubelets, services, endpoints, cAdvisor, and exporters (kube-state-metrics and node_exporter). Alerting rules and Grafana graphs in addons correspond to these metrics. However, etcd rules and graphs currently aren't active/populated.

Situation

Prometheus's can be configured (via the ConfigMap) to scrape the secured :2379/metrics endpoints of etcd nodes just like any other target. The etcd cluster runs on-host, across controllers with systemd, it is a lower-level component on which Kubernetes relies (not atop k8s), and it handles its own client authentication already.

  • Typhoon runs etcd on-host, across controllers, on all platforms
  • Typhoon requires etcd be setup with TLS on all platforms
  • Typhoon creates etcd client certs, but only places them on controller nodes

To perform the scrapes, Prometheus needs the etcd client certificates to write the tls_config section in a new scape job.

Options

  • Add etcd client materials in a kube-system secret. We did this back when self-hosted etcd was explored.
    • Pro: Allows prometheus pod to be scheduled on any node
    • Con: Opens up the possibility of escalation attacks (i.e. read kube-system secrets == read everything)
  • Mount the etcd client materials from a controller host. (current most viable)
    • Pro: Avoid keeping etcd client materials in a Kubernetes secret
    • Con: Restricts prometheus pod itself to run on controller nodes
  • Explore whether its possible to create (or invent) "metrics-only" etcd certificates
    • Likely not on the roadmap
  • Metrics whitelist proxy
    • Con: I use some whitelist proxies for some internal things. They're gross though.
@dghubble
Copy link
Member Author

dghubble commented Feb 1, 2018

I have an example Prometheus config for scraping etcd that works alright, but needs some modifications:

  • etcd TLS certs have user:group etcd:etcd on host which isn't readable within the prometheus pod (which runs as nobody) without modification. fsGroup doesn't apply to hostPath volumes.
  • Filtering out workers (etcd always runs on Typhoon controllers) in Prometheus relabel_config is troublesome because Kubernetes controllers and workers are labeled:
    • node-role.kubernetes.io/master=""
    • node-role.kubernetes.io/node=""

Relabel matching doesn't seem able to distinguish between a label being present vs not, as "" means the same as the label not being present.

The first issue can be addressed by adapting the etcd TLS ownership. The second can be addressed by adding an additional controller label that has a key and value or finding a better Prometheus relabel trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant