Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health and readiness checks to hnc-controller-manager #68

Closed
hferentschik opened this issue Aug 13, 2021 · 9 comments · Fixed by #139
Closed

Add health and readiness checks to hnc-controller-manager #68

hferentschik opened this issue Aug 13, 2021 · 9 comments · Fixed by #139
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@hferentschik
Copy link

I am working on a project where we use HNC as a sub-component. The project ships and installs via Helm. For now, we create our own HNC Helm chart from the provided resources and use this Helm chart as a dependency for our project.

The validating webhook causes us some grief, especially since version 0.8.0 (maybe some code changes increased startup-time of the webhook service in the manager). The problem is described in Helm issue helm/helm#10023. The chart resources are installed fine and HNC, as well as our project, is working fine, however, the Helm release status is stuck on pending-install. The reason is that directly after installing all resources, Helm tries to update the Secret containing the Helm release information. At this time the validating webhook is not responding to request yet and the release update operation fails.

The error looks like this:

install.go:387: [debug] failed to record the release: update: failed to update: Internal error occurred: failed calling webhook "objects.hnc.x-k8s.io": Post "https://hnc-webhook-service.hnc-system.svc:443/validate-objects?timeout=2s": dial tcp 10.96.16.139:443: connect: connection refused

I think Helm should retry the update operation for some time, but I also think that the window in which this problem can occur could be minimized if there were a readiness probe for the manager.

I had a quick look at the code and it seems controller-runtime is used for implementing the webhook. There seems to be no easy way to expose a health/readiness check in the library. I've seen that there is an issue for that I controller-runtime - kubernetes-sigs/controller-runtime#723. There is also a PR already merged - kubernetes-sigs/controller-runtime#1588. The feature is scheduled for release 0.10.x of controller-runtime. Once this release is out, it might make sense to upgrade controller-runtime and expose the check.

@adrianludwin
Copy link
Contributor

adrianludwin commented Aug 16, 2021 via email

@yashvardhan-kukreja
Copy link

yashvardhan-kukreja commented Sep 21, 2021

Hi @adrianludwin and @hferentschik 👋🏼
I guess we need not necessarily wait for upstream controller-runtime for exposing a healthz endpoint and configuring liveness and readyness probes over hnc.

I just hovered over the code and found that it would be a pretty small addition, somewhere on the lines of:

  • In the cmd/manager/main.go, somewhere I can get the WebhookServer with mgr.GetWebhookServer()
  • And, I can register an http handler to it at /healthz path just like the way, admission http handlers are registered here
  • This will ensure that at port 9443 (the webhook server's port), a /healthz endpoint will be available to serve a static health check by returning a 200 everytime it's reached.
  • Finally, this can be utilised to write the corresponding liveness and readiness probes.

Seems pretty simple and straightforward but I'd love to hear your views :)

PS: if you think that the above idea is valid and this issue can be resolved, I'd love to pick this issue up and quickly sort it out :)

@hferentschik
Copy link
Author

I guess we need not necessarily wait for upstream controller-runtime for exposing a healthz endpoint and configuring liveness and readyness probes over hnc.

Sure, this was just a suggestion. I am also not so familiar with the codebase, so I was not aware of whether there are other options. I only knew that the operator sdk did not offer this (a shortcoming I ran into myself before) and that there is an open issue for that on the operator sdk side.

What you are describing makes sense. It probably is a step forward either way. What I am wondering is about the sematics. Will this healthz/ready check for example really be coupled to whether the actual webhook is accepting requests?

But as said, better as nothing imo.

@yashvardhan-kukreja
Copy link

Will this healthz/ready check for example really be coupled to whether the actual webhook is accepting requests?

That's a good question. I feel that the check you are describing is more of a synthetic monitoring check rather than a static health check. So, I don't see a straightforward way to have such a health-check to ensure both that the webhook server is running AND it is successfully accepting the admission requests it is supposed to accept. Nonetheless, I am not at all denying the use of it :)
Yet, I believe that the health-check I described would at least fairly suffice for the setup of liveness and readiness probes coz that check would atleast confirm the "existence" of the webhook server in the sense that there is something running at 9443 and that thing in reality is the webhook server. And that at least is, as you mentioned, fairly better than nothing for the probes.
Many of the tools I have worked with make use of such kinds of static health-checks to determine the health of their services for their liveness and readiness probes.

So, I would suggest moving ahead with the static health-check I described, for now and establish a readiness and a liveness probe over it :)

What do you think @adrianludwin :)

@adrianludwin
Copy link
Contributor

So it turns out that controller-runtime 0.10.1 is already out, so I'd be in favour of just updating and getting it for free :) Why don't I quickly do that.

Well. I'm not actually sure if you get it for free, e.g. I don't know if there's someone we need to do to take advantage of this new feature. But that still seems easier than building it from scratch.

wdyt?

@adrianludwin
Copy link
Contributor

I'm working on #84 to upgrade to controller-runtime 1.10 but it's blocked on another change. Hopefully it'll be unblocked soon.

@adrianludwin
Copy link
Contributor

Ok #84 is merged, so we're on controller-runtime 0.10.1 now. What else do we need to do to make this work?

@mgoltzsche
Copy link

mgoltzsche commented Nov 24, 2021

After enabling the readiness/liveness probe endpoint within the cmd/manager/main.go, you could declare the corresponding readiness/liveness probes within the Deployment at config/manager/manager.yaml.
However, please note that this might conflict with the leader election mechanism during updates: when an update is complete only if the readiness probe becomes complete when the leader has been elected, an update will never complete if the previous Pod still exists and is the leader. To avoid this, I think you need to set the update strategy within the manager Deployment resource to Recreate. However this would mean that the webhook would become unavailable which might not be acceptable. Therefore, I think the webhook would need to be served by a separate Deployment, allowing to scale both the operator (asynchronous controller) and webhook/admission controller independently and apply different update strategies to both.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants