Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VC-34401] Add Prometheus metrics endpoint #271

Merged
merged 8 commits into from
Jun 28, 2024

Conversation

wallrj
Copy link
Member

@wallrj wallrj commented Jun 25, 2024

  • Added Prometheus metrics server to the csi-driver Pods.
  • Updated Helm chart to enable the metrics server by default, but with switch to allow it to be turned off.
  • Updated Helm chart to include optional PodMonitor.

Manual Testing

  • Install csi-driver and cert-manager in a kind cluster
make test-e2e
  • Fetch the metrics
POD_NAME=$(kubectl get pod -n cert-manager -l app=cert-manager-csi-driver -o jsonpath='{ .items[0].metadata.name }')
kubectl get --raw "/api/v1/namespaces/cert-manager/pods/${POD_NAME}:9402/proxy/metrics"
  • Install Prometheus and Grafana
# values.kube-prometheus-stack.yaml
alertmanager:
  enabled: false

grafana:
  enabled: true

nodeExporter:
  enabled: false

# Enable discovery of all ServiceMonitor and PodMonitor resources
# https://github.com/prometheus-community/helm-charts/issues/1911#issuecomment-1106559031
prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
helm upgrade default kube-prometheus-stack \
      --repo https://prometheus-community.github.io/helm-charts \
      --install \
      --namespace prometheus \
      --create-namespace \
      --values values.kube-prometheus-stack.yaml \
      --wait
  • Redeploy csi-driver and cert-manager with PodMonitor resources
diff --git a/make/test-e2e.mk b/make/test-e2e.mk
index 00657c3..c6d4dce 100644
--- a/make/test-e2e.mk
+++ b/make/test-e2e.mk
@@ -34,6 +34,7 @@ e2e-setup-cert-manager: | kind-cluster $(NEEDS_HELM) $(NEEDS_KUBECTL)
                --set startupapicheck.image.repository=$(quay.io/jetstack/cert-manager-startupapicheck.REPO) \
                --set startupapicheck.image.tag=$(quay.io/jetstack/cert-manager-startupapicheck.TAG) \
                --set startupapicheck.image.pullPolicy=Never \
+               --set prometheus.podmonitor.enabled=true \
                cert-manager cert-manager >/dev/null

 # The "install" target can be run on its own with any currently active cluster,
@@ -46,7 +47,7 @@ endif

 test-e2e-deps: INSTALL_OPTIONS :=
 test-e2e-deps: INSTALL_OPTIONS += --set image.repository=$(oci_manager_image_name_development)
-# test-e2e-deps: INSTALL_OPTIONS += --set metrics.enabled=true
+test-e2e-deps: INSTALL_OPTIONS += --set metrics.podmonitor.enabled=true
 test-e2e-deps: e2e-setup-cert-manager
 test-e2e-deps: install
make test-e2e
  • Connect to Grafana and import dashboards
kubectl port-forward -n prometheus deployments/default-grafana 3000

http://localhost:3000/d/ypFZFgvmz/go-processes

Example Dashboards

image image

@cert-manager-prow cert-manager-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2024
@cert-manager-prow
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@cert-manager-prow cert-manager-prow bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 25, 2024
@cert-manager-prow cert-manager-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 25, 2024
@wallrj wallrj force-pushed the prometheus-metrics branch 2 times, most recently from bfdfa3d to 5fa0ccb Compare June 27, 2024 15:51
@cert-manager-prow cert-manager-prow bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 27, 2024
@wallrj wallrj force-pushed the prometheus-metrics branch 2 times, most recently from 2e78578 to e818759 Compare June 27, 2024 16:50
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generated by make generate-helm-docs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generated by make generate-helm-schema

Copy link
Member Author

@wallrj wallrj Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced errgroup to handle the starting and stopping of the driver and health server in separate go routines.

I was curious to see how other projects coordinate starting and stopping groups of services,
and spotted that kube-state-metrics uses a module called github.com/oklog/run.

I initially used the metrics server from cert-manager (990886f), but decided to switch to use the controller-runtime version, for consistency with the approver-policy project (and presumably other of our controller-runtime based controllers).
In addition to supporting a HTTPS metrics server (which we can introduce in another PR) it also supports built-in (kube-rbac-proxy style) authorization which might also be useful to users in future.
That authorization feature seems to have been sponsored by the cluster-api developers:

See:

The metrics server log message looks like this:

$ kubectl logs -n cert-manager daemonsets/cert-manager-csi-driver  --follow
I0628 09:02:23.915961       1 app.go:68] "Starting driver" logger="main" version={"appVersion":"v0.8.1-31-g9f4f02edd3093b","gitCommit":"9f4f02edd3093b7916cafdc9bf98ab6142d00cf7","goVersion":"go1.22.3","compiler":"gc","platform":"linux/amd64"}
I0628 09:02:24.018050       1 manager.go:291] "Registering existing data directory for management" logger="manager" volume_id="csi-50e9aa907b9e40c8fc241e51f4e3b4428578eeaa09919fc3b3f645b7ade9f8e6" volume="csi-50e9aa907b9e40c8fc241e51f4e3b4428578eeaa09919fc3b3f645b7ade9f8e6"
I0628 09:02:24.018629       1 manager.go:291] "Registering existing data directory for management" logger="manager" volume_id="csi-7cbf3b51fd83b210e352b52ea6f3dd1106214e3743fc34eee04bab4f21363f5e" volume="csi-7cbf3b51fd83b210e352b52ea6f3dd1106214e3743fc34eee04bab4f21363f5e"
I0628 09:02:24.018731       1 manager.go:291] "Registering existing data directory for management" logger="manager" volume_id="csi-c7747fc06bd6935902793638ab8ae8bb326d8b534b6bc71aff18c3e224393e94" volume="csi-c7747fc06bd6935902793638ab8ae8bb326d8b534b6bc71aff18c3e224393e94"
I0628 09:02:24.018804       1 manager.go:291] "Registering existing data directory for management" logger="manager" volume_id="csi-ddda9c94fc93801ddf78d94358448af2e2a96a6df5ff836c19f428a6b3336a7f" volume="csi-ddda9c94fc93801ddf78d94358448af2e2a96a6df5ff836c19f428a6b3336a7f"
I0628 09:02:24.019225       1 app.go:115] "running driver" logger="main"
I0628 09:02:24.019228       1 server.go:208] "Starting metrics server" logger="main.controller-runtime.metrics"
I0628 09:02:24.019705       1 server.go:247] "Serving metrics server" logger="main.controller-runtime.metrics" bindAddress=":9402" secure=false
I0628 09:02:50.168149       1 server.go:254] "Shutting down metrics server with timeout of 1 minute" logger="main.controller-runtime.metrics"
I0628 09:02:50.168161       1 app.go:109] "shutting down driver" logger="main" context="context canceled"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting --metrics-bind-address=0, disables the metrics server.
This is consistent with other kubebuilder projects and with approver-policy:

$ go run ./cmd/ --help
...
App flags:
...
      --metrics-bind-address string   TCP address for exposing HTTP Prometheus metrics which will be served on the HTTP path '/metrics'. The value "0" will disable exposing metrics. (default "0")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming the port is not strictly necessary, but adding it allows the PodMonitor (if enabled) to use the named port "http-metrics" rather than the port number.

Copy link
Member Author

@wallrj wallrj Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest thinking is that we only need to provide a PodMonitor, not a ServiceMonitor.

Other cert-manager projects also provide a ServiceMonitor, but we now consider that a legacy.
Disadvantage of ServiceMonitor is that it requires a Service, which adds unnecessary complication to the chart.
And as we understand it, the Service endpoints are unused. PrometheusOperator simply uses the selector of the service to choose which Pods to monitor and then connects to them directly.

The template is copied and adapted from cert-manager:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC with a ServiceMonitor PrometheusOperator uses the Endpoints object created by the Service to discover the targets.

I agree that PodMonitor is less effort as we don't have to create an extra Service

Copy link
Member Author

@wallrj wallrj Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approver-policy doesn't provide a metrics.enabled flag, but I added on here because it seems more intuitive than setting the port to 0.

Here are the differences this flag brings to the chart:

$ diff -u <(helm template $chart) <(helm template $chart --set metrics.enabled=false)
--- /dev/fd/63  2024-06-28 10:17:20.876514437 +0100
+++ /dev/fd/62  2024-06-28 10:17:20.876514437 +0100
@@ -130,7 +130,7 @@
             - --endpoint=$(CSI_ENDPOINT)
             - --data-root=csi-data-dir
             - --use-token-request=false
-            - --metrics-bind-address=:9402
+            - --metrics-bind-address=0
           env:
             - name: NODE_ID
               valueFrom:
@@ -150,8 +150,6 @@
           ports:
             - containerPort: 9809
               name: healthz
-            - containerPort: 9402
-              name: http-metrics
           livenessProbe:
             httpGet:
               path: /healthz

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that cluster-api chose to (effectively) disable the metrics service by default, when they removed kube-rbac-proxy authorization, because they considered the metrics to be too security sensitive to make openly available.

They disabled it by binding the metrics server to a loopback address, so that it was only available to other containers in the Pod.

I will write a release note to say that we are enabling the metrics server by default, and explain to users how they can disable it if they are concerned about making metrics available without authorization.

@wallrj wallrj changed the title WIP: [VC-34401] Add Prometheus metrics endpoint [VC-34401] Add Prometheus metrics endpoint Jun 28, 2024
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
@wallrj wallrj marked this pull request as ready for review June 28, 2024 09:15
@cert-manager-prow cert-manager-prow bot removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 28, 2024
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is a very good test but it helped me verify what the endpoint was serving so I propose leaving it here.
Might prove to be useful for catching if controller-runtime ever change their default metrics and remove the Go and process metrics, which we specifically want to provide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add prometheus.io/scrape annotations to the pods?
So people still discover metrics that way? I couldn't find any good documentation about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is necessary, we already have the podAnnotations variable if people want to add it themselves.

@ThatsMrTalbot
Copy link
Contributor

Very well thought out and written PR.

I don't think it should be covered in this PR, but we should look at adding additional metrics to the endpoint for things successful/failed mounts, successful/failed certificatesigningrequests etc. Anything a platform team may find value in alerting on.

/lgtm

@cert-manager-prow cert-manager-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jun 28, 2024
@wallrj
Copy link
Member Author

wallrj commented Jun 28, 2024

Thanks @ThatsMrTalbot

I'll create some followup PRs with those suggested metrics.

/approve

@cert-manager-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ThatsMrTalbot, wallrj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cert-manager-prow cert-manager-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2024
@cert-manager-prow cert-manager-prow bot merged commit 81e52c2 into cert-manager:main Jun 28, 2024
5 checks passed
@wallrj wallrj deleted the prometheus-metrics branch June 30, 2024 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants