Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Disabling webhook breaks metrics- and health-endpoints #1376

Closed
muffl0n opened this issue Apr 11, 2022 · 8 comments
Closed

[BUG] Disabling webhook breaks metrics- and health-endpoints #1376

muffl0n opened this issue Apr 11, 2022 · 8 comments
Assignees
Labels

Comments

@muffl0n
Copy link
Contributor

muffl0n commented Apr 11, 2022

When setting

bpValidatingWebhook:
  enabled: false

the webhook server is not started. Instead the "normal" server is started, serving these endpoints

  • /metrics
  • /v0/healthz

See

if isCACertMounted() {
go func(config *rest.Config) {
err := handler.RunWebhookServer(config)
if err != nil {
log.WithError(err).Print("Failed to start validating webhook server")
return
}
}(config)
} else {
s := handler.NewServer()
defer func() {
if err := s.Shutdown(ctx); err != nil {
log.WithError(err).Print("Failed to shutdown health check server")
}
}()
go func() {
if err := s.ListenAndServe(); err != nil {
log.WithError(err).Print("Failed to start health check server")
}
}()
}

Unfortunately, disabling the bpValidatingWebhook does not adjust the service created in https://github.com/kanisterio/kanister/blob/master/helm/kanister-operator/templates/service.yaml
So disabling the ValidatingWebhook actually breaks accessing /metrics via the service.
It also would break the health-checks (if there were any) cause the port has changed:

  • webhook server listens on port 9443
  • "normal" server listens on port 8000

To Reproduce

  1. Install kanister with bpValidatingWebhook enabled
  2. Access /metrics via https://kanister-kanister-operator/metrics -> metrics are served
  3. Reinstall kanister with bpValidatingWebhook disabled
  4. Access /metrics via https://kanister-kanister-operator/metrics -> error Connection refused

Expected behavior
Enabling/disabling the webhook should allow the metrics to be accessed. Also, it should not change the URL they can be reached at.

I would propose to decouple serving the webhook and the other endpoints. On the one hand, there is no need to serve the other endpoints via https. Also, as the certificate is self-signed, one would have to allow this self-signed certificate to be accepted when accessing the metrics. When accessing health-checks via https, kubernetes automatically accepts all certificates.

@muffl0n muffl0n added the bug label Apr 11, 2022
@ihcsim
Copy link
Contributor

ihcsim commented Apr 11, 2022

Sounds like the Service resource needs to expose port 8000 when the webhook is disabled. Yeah, I would prefer the health check and metrics endpoints be decoupled from the admission endpoint. How would Prometheus even scrape the metrics tls endpoint if the server is using a self-signed cert?

@muffl0n
Copy link
Contributor Author

muffl0n commented Apr 11, 2022

One can set insecure_skip_verify in tls_config. But I'm not sure about how to do that with the preferred way with ServiceMonitor. Speaking of: we should add one for Kanister ;)

@ihcsim ihcsim removed the triage label Apr 13, 2022
@github-actions
Copy link
Contributor

This issue is marked as stale due to inactivity. Add a new comment to reactivate it.

@github-actions github-actions bot added the stale label Jun 13, 2022
@ihcsim
Copy link
Contributor

ihcsim commented Jun 13, 2022

Still relevant.

@ihcsim
Copy link
Contributor

ihcsim commented Jun 16, 2022

Fixed by #1476 and #1488.

@ihcsim ihcsim closed this as completed Jun 16, 2022
@muffl0n
Copy link
Contributor Author

muffl0n commented Jun 21, 2022

Thank you for fixing this!

I took a deeper look and am having these concerns:
The endpoint (protocol and port) of the health and metrics service change when the webhook is enabled/disabled. For the user this is pretty much unexpected behavior.
Let’s say, the user configured a prometheus ServiceMonitor to scrape the operator. After enabling/disabling the webhook, some labels of the resulting metrics change unexpectedly. This could lead to confusion.

These are the solutions I am thinking of:

  1. We enable the http server (for metrics and health) as a default and start the second https server for the webhook when it is configured to be enabled. We would need a second service for the webhook, but the metrics and health endpoint would never change.

  2. We enable the https server as a default and just add the handlers we want dynamically: metrics and health as a default and webhook handler when enabled.

I would prefer solution 1, cause we really don’t need https for metrics or health, removing the need to ignore the „invalid“ certificate when scraping the metrics with prometheus.

What do you think?

@ihcsim
Copy link
Contributor

ihcsim commented Jun 21, 2022

I think ultimately, the metrics/health endpoints should be separated from the webhook. In addition,

i. the webhook should be a required component - what's the point of having it if I can just disable it
i. all traffic should be TLS'd - if someone can MITM your metrics data, they can get a very good sense of all the namespaces, workloads and policies in your cluster. The need to protect /metrics is why things like kube-rbac-proxy exists.

@ihcsim
Copy link
Contributor

ihcsim commented Jun 21, 2022

Created #1500 to track webhook hardening improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants