Secure diagnostics (metrics, pprof, log level changes) #9289

sbueringer · 2023-08-23T13:16:34Z

What would you like to be added (User Story)?

As an operator I would like to be able to safely scrape metrics from Cluster API controllers with minimal effort.

Detailed Description

Today Cluster API only provides a metrics-bind-addr flag to configure the metrics endpoint. The metrics endpoint is always using http and doesn't have any authorization. Because of security concerns nowadays Cluster API has a default value of localhost:8080. This means that the metrics are only available on localhost which makes it hard to scrape them, e.g. via Prometheus. Folks can set the flag to 0.0.0.0:8080 but then everyone can access the metrics. We don't have any secrets in our metrics but it was still considered too unsafe to make it the default.

Controller-runtime implemented a new feature with v0.16 which makes it easy to provide a secure endpoint for metrics which uses https and provides authentication and authorization (kubernetes-sigs/controller-runtime#2407).

On a high-level we can now expose metrics the same way as core Kubernetes controllers (https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#metrics-in-kubernetes).

To scrape metrics with the secured endpoint, they would now need a ClusterRole like the following:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
  - nonResourceURLs:
      - "/metrics"
    verbs:
      - get

Note: A ClusterRole like this is already deployed per default in the Prometheus Helm chart, so that Prometheus is able to scrape metrics from core Kubernetes components. The only thing folks should need on their side when scraping metrics from Cluster API controllers is this config: https://github.com/sbueringer/cluster-api/blob/8a2de8c0060d2dc5169d3ebb86dc5605bc856492/hack/observability/prometheus/values.yaml#L31-L33. Everything else should just work out-of-the-box per default.

For folks who still remember, this is basically a subset of the functionality of kube-rbac-proxy that we used in the past.

Notes:

It will be possible to disable this new behavior, but I would like to make it the default (flag details TBD).
A very first implementation can be seen here: ⚠️ Implement secure diagnostics (metrics, pprof, log level changes) #9264. Although it doesn't contain the flags yet and also contains a way to dynamically change log levels, but I'll remove this from this PR and address it in a follow-up in a bit)
Obviously we should document in the book how to scrape metrics from CAPI controllers.
We should also recommend providers to implement it as well.

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

sbueringer · 2023-08-23T13:23:23Z

cc @chrischdi @killianmuldoon @fabriziopandini

I'll bring this up today in the office hours

/triage accepted

fabriziopandini · 2023-09-06T14:41:27Z

huge +1 from me, thanks for bringing this up

sbueringer · 2023-09-08T06:27:45Z

After some experimentation I would propose the following:

deprecate --metrics-bind-addr flag
add new flag --diagnostics-address
add new flag --insecure-diagnostics

With the following behavior

If --metrics-bind-addr is set:
- metrics are served on http without authentication/authorization (as today)
If --diagnostics-address=<addr> --insecure-diagnostics is set
- same behavior as --metrics-bind-addr
If --diagnostics-address is set
- metrics are served on https with authentication/authorization
- in addition pprof endpoints and an endpoint to change the log level are served (also protected)

This should allow a smooth transition and it is now possible to easily and securely expose metrics in production. Additionally, the pprof endpoint can now be also always enabled and it is possible to change log levels dynamically.

This should make it a lot easier to debug Cluster API in production.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 23, 2023

sbueringer added the area/metrics Issues or PRs related to metrics label Aug 23, 2023

sbueringer self-assigned this Aug 23, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 23, 2023

sbueringer mentioned this issue Aug 25, 2023

🐛 Adding metrics container port in tilt-prepare only if it's missing #9308

Merged

sbueringer mentioned this issue Sep 8, 2023

⚠️ Implement secure diagnostics (metrics, pprof, log level changes) #9264

Merged

3 tasks

sbueringer changed the title ~~Secure metrics serving~~ Secure diagnostics (metrics, pprof, log level changes) Sep 8, 2023

k8s-ci-robot closed this as completed in #9264 Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Secure diagnostics (metrics, pprof, log level changes) #9289

Secure diagnostics (metrics, pprof, log level changes) #9289

sbueringer commented Aug 23, 2023 •

edited

Loading

sbueringer commented Aug 23, 2023

fabriziopandini commented Sep 6, 2023

sbueringer commented Sep 8, 2023

Secure diagnostics (metrics, pprof, log level changes) #9289

Secure diagnostics (metrics, pprof, log level changes) #9289

Comments

sbueringer commented Aug 23, 2023 • edited Loading

What would you like to be added (User Story)?

Detailed Description

Anything else you would like to add?

Label(s) to be applied

sbueringer commented Aug 23, 2023

fabriziopandini commented Sep 6, 2023

sbueringer commented Sep 8, 2023

sbueringer commented Aug 23, 2023 •

edited

Loading