-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VC-34401] Add Prometheus metrics endpoint #271
[VC-34401] Add Prometheus metrics endpoint #271
Conversation
Skipping CI for Draft Pull Request. |
ade19a2
to
d1975e9
Compare
bfdfa3d
to
5fa0ccb
Compare
2e78578
to
e818759
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generated by make generate-helm-docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generated by make generate-helm-schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduced errgroup to handle the starting and stopping of the driver and health server in separate go routines.
I was curious to see how other projects coordinate starting and stopping groups of services,
and spotted that kube-state-metrics uses a module called github.com/oklog/run.
I initially used the metrics server from cert-manager (990886f), but decided to switch to use the controller-runtime version, for consistency with the approver-policy project (and presumably other of our controller-runtime based controllers).
In addition to supporting a HTTPS metrics server (which we can introduce in another PR) it also supports built-in (kube-rbac-proxy style) authorization which might also be useful to users in future.
That authorization feature seems to have been sponsored by the cluster-api developers:
- https://cluster-api.sigs.k8s.io/tasks/diagnostics#scraping-metrics
- Add authorization for metrics endpoint kubernetes-sigs/controller-runtime#2073
See:
- controller-runtime MetricsServer implementation: https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/metrics/server/server.go#L116
- controller-runtime default metrics collectors: https://github.com/kubernetes-sigs/controller-runtime/blob/700befecdffa803d19830a6a43adc5779ed01e26/pkg/internal/controller/metrics/metrics.go#L73-L86
- Create a separate registry for work-queue metrics kubernetes-sigs/controller-runtime#2670
The metrics server log message looks like this:
$ kubectl logs -n cert-manager daemonsets/cert-manager-csi-driver --follow
I0628 09:02:23.915961 1 app.go:68] "Starting driver" logger="main" version={"appVersion":"v0.8.1-31-g9f4f02edd3093b","gitCommit":"9f4f02edd3093b7916cafdc9bf98ab6142d00cf7","goVersion":"go1.22.3","compiler":"gc","platform":"linux/amd64"}
I0628 09:02:24.018050 1 manager.go:291] "Registering existing data directory for management" logger="manager" volume_id="csi-50e9aa907b9e40c8fc241e51f4e3b4428578eeaa09919fc3b3f645b7ade9f8e6" volume="csi-50e9aa907b9e40c8fc241e51f4e3b4428578eeaa09919fc3b3f645b7ade9f8e6"
I0628 09:02:24.018629 1 manager.go:291] "Registering existing data directory for management" logger="manager" volume_id="csi-7cbf3b51fd83b210e352b52ea6f3dd1106214e3743fc34eee04bab4f21363f5e" volume="csi-7cbf3b51fd83b210e352b52ea6f3dd1106214e3743fc34eee04bab4f21363f5e"
I0628 09:02:24.018731 1 manager.go:291] "Registering existing data directory for management" logger="manager" volume_id="csi-c7747fc06bd6935902793638ab8ae8bb326d8b534b6bc71aff18c3e224393e94" volume="csi-c7747fc06bd6935902793638ab8ae8bb326d8b534b6bc71aff18c3e224393e94"
I0628 09:02:24.018804 1 manager.go:291] "Registering existing data directory for management" logger="manager" volume_id="csi-ddda9c94fc93801ddf78d94358448af2e2a96a6df5ff836c19f428a6b3336a7f" volume="csi-ddda9c94fc93801ddf78d94358448af2e2a96a6df5ff836c19f428a6b3336a7f"
I0628 09:02:24.019225 1 app.go:115] "running driver" logger="main"
I0628 09:02:24.019228 1 server.go:208] "Starting metrics server" logger="main.controller-runtime.metrics"
I0628 09:02:24.019705 1 server.go:247] "Serving metrics server" logger="main.controller-runtime.metrics" bindAddress=":9402" secure=false
I0628 09:02:50.168149 1 server.go:254] "Shutting down metrics server with timeout of 1 minute" logger="main.controller-runtime.metrics"
I0628 09:02:50.168161 1 app.go:109] "shutting down driver" logger="main" context="context canceled"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting --metrics-bind-address=0
, disables the metrics server.
This is consistent with other kubebuilder projects and with approver-policy:
- https://book.kubebuilder.io/reference/metrics#enabling-the-metrics
- https://github.com/cert-manager/approver-policy/blob/1cbe92ce00f4584cd159ca52b1b183bf064e6da3/pkg/internal/cmd/options/options.go#L197-L199
$ go run ./cmd/ --help
...
App flags:
...
--metrics-bind-address string TCP address for exposing HTTP Prometheus metrics which will be served on the HTTP path '/metrics'. The value "0" will disable exposing metrics. (default "0")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming the port is not strictly necessary, but adding it allows the PodMonitor (if enabled) to use the named port "http-metrics" rather than the port number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest thinking is that we only need to provide a PodMonitor, not a ServiceMonitor.
Other cert-manager projects also provide a ServiceMonitor, but we now consider that a legacy.
Disadvantage of ServiceMonitor is that it requires a Service, which adds unnecessary complication to the chart.
And as we understand it, the Service endpoints are unused. PrometheusOperator simply uses the selector of the service to choose which Pods to monitor and then connects to them directly.
The template is copied and adapted from cert-manager:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC with a ServiceMonitor PrometheusOperator uses the Endpoints object created by the Service to discover the targets.
I agree that PodMonitor is less effort as we don't have to create an extra Service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approver-policy doesn't provide a metrics.enabled
flag, but I added on here because it seems more intuitive than setting the port to 0
.
Here are the differences this flag brings to the chart:
$ diff -u <(helm template $chart) <(helm template $chart --set metrics.enabled=false)
--- /dev/fd/63 2024-06-28 10:17:20.876514437 +0100
+++ /dev/fd/62 2024-06-28 10:17:20.876514437 +0100
@@ -130,7 +130,7 @@
- --endpoint=$(CSI_ENDPOINT)
- --data-root=csi-data-dir
- --use-token-request=false
- - --metrics-bind-address=:9402
+ - --metrics-bind-address=0
env:
- name: NODE_ID
valueFrom:
@@ -150,8 +150,6 @@
ports:
- containerPort: 9809
name: healthz
- - containerPort: 9402
- name: http-metrics
livenessProbe:
httpGet:
path: /healthz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that cluster-api chose to (effectively) disable the metrics service by default, when they removed kube-rbac-proxy
authorization, because they considered the metrics to be too security sensitive to make openly available.
They disabled it by binding the metrics server to a loopback address, so that it was only available to other containers in the Pod.
I will write a release note to say that we are enabling the metrics server by default, and explain to users how they can disable it if they are concerned about making metrics available without authorization.
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
9a66116
to
ab3a39d
Compare
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is a very good test but it helped me verify what the endpoint was serving so I propose leaving it here.
Might prove to be useful for catching if controller-runtime ever change their default metrics and remove the Go and process metrics, which we specifically want to provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add prometheus.io/scrape
annotations to the pods?
So people still discover metrics that way? I couldn't find any good documentation about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is necessary, we already have the podAnnotations
variable if people want to add it themselves.
Very well thought out and written PR. I don't think it should be covered in this PR, but we should look at adding additional metrics to the endpoint for things successful/failed mounts, successful/failed certificatesigningrequests etc. Anything a platform team may find value in alerting on. /lgtm |
Thanks @ThatsMrTalbot I'll create some followup PRs with those suggested metrics. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ThatsMrTalbot, wallrj The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Manual Testing
Example Dashboards