[VC-34401] Add Prometheus metrics endpoint #271

wallrj · 2024-06-25T12:54:38Z

Added Prometheus metrics server to the csi-driver Pods.
Updated Helm chart to enable the metrics server by default, but with switch to allow it to be turned off.
Updated Helm chart to include optional PodMonitor.

Manual Testing

Install csi-driver and cert-manager in a kind cluster

make test-e2e

Fetch the metrics

POD_NAME=$(kubectl get pod -n cert-manager -l app=cert-manager-csi-driver -o jsonpath='{ .items[0].metadata.name }')
kubectl get --raw "/api/v1/namespaces/cert-manager/pods/${POD_NAME}:9402/proxy/metrics"

Install Prometheus and Grafana

# values.kube-prometheus-stack.yaml
alertmanager:
  enabled: false

grafana:
  enabled: true

nodeExporter:
  enabled: false

# Enable discovery of all ServiceMonitor and PodMonitor resources
# https://github.com/prometheus-community/helm-charts/issues/1911#issuecomment-1106559031
prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false

helm upgrade default kube-prometheus-stack \
      --repo https://prometheus-community.github.io/helm-charts \
      --install \
      --namespace prometheus \
      --create-namespace \
      --values values.kube-prometheus-stack.yaml \
      --wait

Redeploy csi-driver and cert-manager with PodMonitor resources

diff --git a/make/test-e2e.mk b/make/test-e2e.mk
index 00657c3..c6d4dce 100644
--- a/make/test-e2e.mk
+++ b/make/test-e2e.mk
@@ -34,6 +34,7 @@ e2e-setup-cert-manager: | kind-cluster $(NEEDS_HELM) $(NEEDS_KUBECTL)
                --set startupapicheck.image.repository=$(quay.io/jetstack/cert-manager-startupapicheck.REPO) \
                --set startupapicheck.image.tag=$(quay.io/jetstack/cert-manager-startupapicheck.TAG) \
                --set startupapicheck.image.pullPolicy=Never \
+               --set prometheus.podmonitor.enabled=true \
                cert-manager cert-manager >/dev/null

 # The "install" target can be run on its own with any currently active cluster,
@@ -46,7 +47,7 @@ endif

 test-e2e-deps: INSTALL_OPTIONS :=
 test-e2e-deps: INSTALL_OPTIONS += --set image.repository=$(oci_manager_image_name_development)
-# test-e2e-deps: INSTALL_OPTIONS += --set metrics.enabled=true
+test-e2e-deps: INSTALL_OPTIONS += --set metrics.podmonitor.enabled=true
 test-e2e-deps: e2e-setup-cert-manager
 test-e2e-deps: install

make test-e2e

Connect to Grafana and import dashboards

kubectl port-forward -n prometheus deployments/default-grafana 3000

http://localhost:3000/d/ypFZFgvmz/go-processes

Example Dashboards

https://raw.githubusercontent.com/kubernetes-sigs/cluster-api/main/hack/observability/grafana/dashboards/controller-runtime.json

https://grafana.com/grafana/dashboards/6671-go-processes/

cert-manager-prow · 2024-06-25T12:54:41Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

wallrj · 2024-06-28T06:28:51Z

deploy/charts/csi-driver/README.md

Generated by make generate-helm-docs

wallrj · 2024-06-28T06:29:13Z

deploy/charts/csi-driver/values.schema.json

Generated by make generate-helm-schema

wallrj · 2024-06-28T08:34:53Z

cmd/app/app.go

Introduced errgroup to handle the starting and stopping of the driver and health server in separate go routines.

I was curious to see how other projects coordinate starting and stopping groups of services,
and spotted that kube-state-metrics uses a module called github.com/oklog/run.

I initially used the metrics server from cert-manager (990886f), but decided to switch to use the controller-runtime version, for consistency with the approver-policy project (and presumably other of our controller-runtime based controllers).
In addition to supporting a HTTPS metrics server (which we can introduce in another PR) it also supports built-in (kube-rbac-proxy style) authorization which might also be useful to users in future.
That authorization feature seems to have been sponsored by the cluster-api developers:

https://cluster-api.sigs.k8s.io/tasks/diagnostics#scraping-metrics

Add authorization for metrics endpoint kubernetes-sigs/controller-runtime#2073

See:

controller-runtime MetricsServer implementation: https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/metrics/server/server.go#L116

controller-runtime default metrics collectors: https://github.com/kubernetes-sigs/controller-runtime/blob/700befecdffa803d19830a6a43adc5779ed01e26/pkg/internal/controller/metrics/metrics.go#L73-L86

Create a separate registry for work-queue metrics kubernetes-sigs/controller-runtime#2670

The metrics server log message looks like this:

$ kubectl logs -n cert-manager daemonsets/cert-manager-csi-driver --follow I0628 09:02:23.915961 1 app.go:68] "Starting driver" logger="main" version={"appVersion":"v0.8.1-31-g9f4f02edd3093b","gitCommit":"9f4f02edd3093b7916cafdc9bf98ab6142d00cf7","goVersion":"go1.22.3","compiler":"gc","platform":"linux/amd64"} I0628 09:02:24.018050 1 manager.go:291] "Registering existing data directory for management" logger="manager" volume_id="csi-50e9aa907b9e40c8fc241e51f4e3b4428578eeaa09919fc3b3f645b7ade9f8e6" volume="csi-50e9aa907b9e40c8fc241e51f4e3b4428578eeaa09919fc3b3f645b7ade9f8e6" I0628 09:02:24.018629 1 manager.go:291] "Registering existing data directory for management" logger="manager" volume_id="csi-7cbf3b51fd83b210e352b52ea6f3dd1106214e3743fc34eee04bab4f21363f5e" volume="csi-7cbf3b51fd83b210e352b52ea6f3dd1106214e3743fc34eee04bab4f21363f5e" I0628 09:02:24.018731 1 manager.go:291] "Registering existing data directory for management" logger="manager" volume_id="csi-c7747fc06bd6935902793638ab8ae8bb326d8b534b6bc71aff18c3e224393e94" volume="csi-c7747fc06bd6935902793638ab8ae8bb326d8b534b6bc71aff18c3e224393e94" I0628 09:02:24.018804 1 manager.go:291] "Registering existing data directory for management" logger="manager" volume_id="csi-ddda9c94fc93801ddf78d94358448af2e2a96a6df5ff836c19f428a6b3336a7f" volume="csi-ddda9c94fc93801ddf78d94358448af2e2a96a6df5ff836c19f428a6b3336a7f" I0628 09:02:24.019225 1 app.go:115] "running driver" logger="main" I0628 09:02:24.019228 1 server.go:208] "Starting metrics server" logger="main.controller-runtime.metrics" I0628 09:02:24.019705 1 server.go:247] "Serving metrics server" logger="main.controller-runtime.metrics" bindAddress=":9402" secure=false I0628 09:02:50.168149 1 server.go:254] "Shutting down metrics server with timeout of 1 minute" logger="main.controller-runtime.metrics" I0628 09:02:50.168161 1 app.go:109] "shutting down driver" logger="main" context="context canceled"

wallrj · 2024-06-28T08:43:29Z

cmd/app/options/options.go

Setting --metrics-bind-address=0, disables the metrics server.
This is consistent with other kubebuilder projects and with approver-policy:

https://book.kubebuilder.io/reference/metrics#enabling-the-metrics

https://github.com/cert-manager/approver-policy/blob/1cbe92ce00f4584cd159ca52b1b183bf064e6da3/pkg/internal/cmd/options/options.go#L197-L199

$ go run ./cmd/ --help ... App flags: ... --metrics-bind-address string TCP address for exposing HTTP Prometheus metrics which will be served on the HTTP path '/metrics'. The value "0" will disable exposing metrics. (default "0")

wallrj · 2024-06-28T08:44:43Z

deploy/charts/csi-driver/templates/daemonset.yaml

Naming the port is not strictly necessary, but adding it allows the PodMonitor (if enabled) to use the named port "http-metrics" rather than the port number.

misleading comments in container.Ports kubernetes/kubernetes#108255

https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#ports

wallrj · 2024-06-28T08:48:09Z

deploy/charts/csi-driver/templates/podmonitor.yaml

The latest thinking is that we only need to provide a PodMonitor, not a ServiceMonitor.

Other cert-manager projects also provide a ServiceMonitor, but we now consider that a legacy.
Disadvantage of ServiceMonitor is that it requires a Service, which adds unnecessary complication to the chart.
And as we understand it, the Service endpoints are unused. PrometheusOperator simply uses the selector of the service to choose which Pods to monitor and then connects to them directly.

The template is copied and adapted from cert-manager:

https://github.com/cert-manager/cert-manager/blob/master/deploy/charts/cert-manager/templates/podmonitor.yaml

IIRC with a ServiceMonitor PrometheusOperator uses the Endpoints object created by the Service to discover the targets.

I agree that PodMonitor is less effort as we don't have to create an extra Service

wallrj · 2024-06-28T09:08:48Z

deploy/charts/csi-driver/values.yaml

approver-policy doesn't provide a metrics.enabled flag, but I added on here because it seems more intuitive than setting the port to 0.

https://github.com/cert-manager/approver-policy/blob/1cbe92ce00f4584cd159ca52b1b183bf064e6da3/deploy/charts/approver-policy/values.yaml#L78-L80

Here are the differences this flag brings to the chart:

$ diff -u <(helm template $chart) <(helm template $chart --set metrics.enabled=false) --- /dev/fd/63 2024-06-28 10:17:20.876514437 +0100 +++ /dev/fd/62 2024-06-28 10:17:20.876514437 +0100 @@ -130,7 +130,7 @@ - --endpoint=$(CSI_ENDPOINT) - --data-root=csi-data-dir - --use-token-request=false - - --metrics-bind-address=:9402 + - --metrics-bind-address=0 env: - name: NODE_ID valueFrom: @@ -150,8 +150,6 @@ ports: - containerPort: 9809 name: healthz - - containerPort: 9402 - name: http-metrics livenessProbe: httpGet: path: /healthz

I noticed that cluster-api chose to (effectively) disable the metrics service by default, when they removed kube-rbac-proxy authorization, because they considered the metrics to be too security sensitive to make openly available.

⚠️ Remove kube-rbac-proxy and expose metrics on localhost:8080 kubernetes-sigs/cluster-api#4640

They disabled it by binding the metrics server to a loopback address, so that it was only available to other containers in the Pod.

I will write a release note to say that we are enabling the metrics server by default, and explain to users how they can disable it if they are concerned about making metrics available without authorization.

Signed-off-by: Richard Wall <richard.wall@venafi.com>

wallrj · 2024-06-28T11:12:13Z

test/e2e/suite/cases/metrics.go

Not sure if this is a very good test but it helped me verify what the endpoint was serving so I propose leaving it here.
Might prove to be useful for catching if controller-runtime ever change their default metrics and remove the Go and process metrics, which we specifically want to provide.

wallrj · 2024-06-28T11:29:11Z

deploy/charts/csi-driver/templates/daemonset.yaml

Should I add prometheus.io/scrape annotations to the pods?
So people still discover metrics that way? I couldn't find any good documentation about it.

I don't think that is necessary, we already have the podAnnotations variable if people want to add it themselves.

ThatsMrTalbot · 2024-06-28T12:37:48Z

Very well thought out and written PR.

I don't think it should be covered in this PR, but we should look at adding additional metrics to the endpoint for things successful/failed mounts, successful/failed certificatesigningrequests etc. Anything a platform team may find value in alerting on.

/lgtm

wallrj · 2024-06-28T13:43:16Z

Thanks @ThatsMrTalbot

I'll create some followup PRs with those suggested metrics.

/approve

cert-manager-prow · 2024-06-28T13:43:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ThatsMrTalbot, wallrj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wallrj]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cert-manager-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2024

cert-manager-prow bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 25, 2024

wallrj force-pushed the prometheus-metrics branch from ade19a2 to d1975e9 Compare June 25, 2024 17:04

cert-manager-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 25, 2024

wallrj force-pushed the prometheus-metrics branch 2 times, most recently from bfdfa3d to 5fa0ccb Compare June 27, 2024 15:51

cert-manager-prow bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 27, 2024

wallrj force-pushed the prometheus-metrics branch 2 times, most recently from 2e78578 to e818759 Compare June 27, 2024 16:50

wallrj commented Jun 28, 2024

View reviewed changes

wallrj changed the title ~~WIP: [VC-34401] Add Prometheus metrics endpoint~~ [VC-34401] Add Prometheus metrics endpoint Jun 28, 2024

wallrj added 6 commits June 28, 2024 10:14

E2E test for go_collector and process_collector metrics

9ac6b38

Signed-off-by: Richard Wall <richard.wall@venafi.com>

Add Go and Process metrics

1f6d79f

Signed-off-by: Richard Wall <richard.wall@venafi.com>

Update the Helm chart with metrics settings

273dd1e

Signed-off-by: Richard Wall <richard.wall@venafi.com>

Use controller-runtime metrics server

5413205

Signed-off-by: Richard Wall <richard.wall@venafi.com>

Enable podmonitor

5b7d752

Signed-off-by: Richard Wall <richard.wall@venafi.com>

Self review changes

ab3a39d

Signed-off-by: Richard Wall <richard.wall@venafi.com>

wallrj force-pushed the prometheus-metrics branch from 9a66116 to ab3a39d Compare June 28, 2024 09:15

wallrj marked this pull request as ready for review June 28, 2024 09:15

cert-manager-prow bot removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 28, 2024

wallrj added 2 commits June 28, 2024 11:23

Fix import order

3f82ded

Signed-off-by: Richard Wall <richard.wall@venafi.com>

Simplify tests

7e0684c

Signed-off-by: Richard Wall <richard.wall@venafi.com>

wallrj commented Jun 28, 2024

View reviewed changes

wallrj requested a review from ThatsMrTalbot June 28, 2024 11:12

wallrj commented Jun 28, 2024

View reviewed changes

cert-manager-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jun 28, 2024

ThatsMrTalbot approved these changes Jun 28, 2024

View reviewed changes

cert-manager-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2024

cert-manager-prow bot merged commit 81e52c2 into cert-manager:main Jun 28, 2024
5 checks passed

wallrj mentioned this pull request Jun 28, 2024

[VC-34401] Add metrics settings to the Helm chart jetstack/jetstack-secure#544

Merged

wallrj deleted the prometheus-metrics branch June 30, 2024 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VC-34401] Add Prometheus metrics endpoint #271

[VC-34401] Add Prometheus metrics endpoint #271

wallrj commented Jun 25, 2024 •

edited

Loading

cert-manager-prow bot commented Jun 25, 2024

wallrj Jun 28, 2024

wallrj Jun 28, 2024

wallrj Jun 28, 2024 •

edited

Loading

wallrj Jun 28, 2024

wallrj Jun 28, 2024

wallrj Jun 28, 2024 •

edited

Loading

ThatsMrTalbot Jun 28, 2024

wallrj Jun 28, 2024 •

edited

Loading

wallrj Jun 28, 2024

wallrj Jun 28, 2024

wallrj Jun 28, 2024

ThatsMrTalbot Jun 28, 2024

ThatsMrTalbot commented Jun 28, 2024

wallrj commented Jun 28, 2024

cert-manager-prow bot commented Jun 28, 2024

[VC-34401] Add Prometheus metrics endpoint #271

[VC-34401] Add Prometheus metrics endpoint #271

Conversation

wallrj commented Jun 25, 2024 • edited Loading

Manual Testing

Example Dashboards

cert-manager-prow bot commented Jun 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallrj Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallrj Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallrj Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThatsMrTalbot commented Jun 28, 2024

wallrj commented Jun 28, 2024

cert-manager-prow bot commented Jun 28, 2024

wallrj commented Jun 25, 2024 •

edited

Loading

wallrj Jun 28, 2024 •

edited

Loading

wallrj Jun 28, 2024 •

edited

Loading

wallrj Jun 28, 2024 •

edited

Loading