Add health and readiness checks to hnc-controller-manager #68

hferentschik · 2021-08-13T09:10:41Z

I am working on a project where we use HNC as a sub-component. The project ships and installs via Helm. For now, we create our own HNC Helm chart from the provided resources and use this Helm chart as a dependency for our project.

The validating webhook causes us some grief, especially since version 0.8.0 (maybe some code changes increased startup-time of the webhook service in the manager). The problem is described in Helm issue helm/helm#10023. The chart resources are installed fine and HNC, as well as our project, is working fine, however, the Helm release status is stuck on pending-install. The reason is that directly after installing all resources, Helm tries to update the Secret containing the Helm release information. At this time the validating webhook is not responding to request yet and the release update operation fails.

The error looks like this:

install.go:387: [debug] failed to record the release: update: failed to update: Internal error occurred: failed calling webhook "objects.hnc.x-k8s.io": Post "https://hnc-webhook-service.hnc-system.svc:443/validate-objects?timeout=2s": dial tcp 10.96.16.139:443: connect: connection refused

I think Helm should retry the update operation for some time, but I also think that the window in which this problem can occur could be minimized if there were a readiness probe for the manager.

I had a quick look at the code and it seems controller-runtime is used for implementing the webhook. There seems to be no easy way to expose a health/readiness check in the library. I've seen that there is an issue for that I controller-runtime - kubernetes-sigs/controller-runtime#723. There is also a PR already merged - kubernetes-sigs/controller-runtime#1588. The feature is scheduled for release 0.10.x of controller-runtime. Once this release is out, it might make sense to upgrade controller-runtime and expose the check.

The text was updated successfully, but these errors were encountered:

adrianludwin · 2021-08-16T13:43:48Z

This sounds like a useful and sensible change to make, thanks! Please ping us when controller-runtime 0.10 is out if we forget.

…

On Fri, Aug 13, 2021 at 5:10 AM Hardy Ferentschik ***@***.***> wrote: I am working on a project where we use HNC as a sub-component. The project ships and installs via Helm. For now, we create our own HNC Helm chart from the provided resources and use this Helm chart as a dependency for our project. The validating webhook causes us some grief, especially since version 0.8.0 (maybe some code changes increased startup-time of the webhook service in the manager). The problem is described in Helm issue helm/helm#10023 <helm/helm#10023>. The chart resources are installed fine and HNC, as well as our project, is working fine, however, the Helm release status is stuck on *pending-install*. The reason is that directly after installing all resources, Helm tries to update the Secret containing the Helm release information. At this time the validating webhook is not responding to request yet and the release update operation fails. The error looks like this: install.go:387: [debug] failed to record the release: update: failed to update: Internal error occurred: failed calling webhook "objects.hnc.x-k8s.io": Post "https://hnc-webhook-service.hnc-system.svc:443/validate-objects?timeout=2s": dial tcp 10.96.16.139:443: connect: connection refused I think Helm should retry the update operation for some time, but I also think that the window in which this problem can occur could be minimized if there were a readiness probe for the manager. I had a quick look at the code and it seems controller-runtime is used for implementing the webhook. There seems to be no easy way to expose a health/readiness check in the library. I've seen that there is an issue for that I controller-runtime - kubernetes-sigs/controller-runtime#723 <kubernetes-sigs/controller-runtime#723>. There is also a PR already merged - kubernetes-sigs/controller-runtime#1588 <kubernetes-sigs/controller-runtime#1588>. The feature is scheduled for release 0.10.x of controller-runtime. Once this release is out, it might make sense to upgrade controller-runtime and expose the check. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#68>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE43PZBJWZ4XHS6SFHRT5GLT4TORZANCNFSM5CDGBZUQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

yashvardhan-kukreja · 2021-09-21T22:20:00Z

Hi @adrianludwin and @hferentschik 👋🏼
I guess we need not necessarily wait for upstream controller-runtime for exposing a healthz endpoint and configuring liveness and readyness probes over hnc.

I just hovered over the code and found that it would be a pretty small addition, somewhere on the lines of:

In the cmd/manager/main.go, somewhere I can get the WebhookServer with mgr.GetWebhookServer()
And, I can register an http handler to it at /healthz path just like the way, admission http handlers are registered here
This will ensure that at port 9443 (the webhook server's port), a /healthz endpoint will be available to serve a static health check by returning a 200 everytime it's reached.
Finally, this can be utilised to write the corresponding liveness and readiness probes.

Seems pretty simple and straightforward but I'd love to hear your views :)

PS: if you think that the above idea is valid and this issue can be resolved, I'd love to pick this issue up and quickly sort it out :)

hferentschik · 2021-09-24T13:48:45Z

I guess we need not necessarily wait for upstream controller-runtime for exposing a healthz endpoint and configuring liveness and readyness probes over hnc.

Sure, this was just a suggestion. I am also not so familiar with the codebase, so I was not aware of whether there are other options. I only knew that the operator sdk did not offer this (a shortcoming I ran into myself before) and that there is an open issue for that on the operator sdk side.

What you are describing makes sense. It probably is a step forward either way. What I am wondering is about the sematics. Will this healthz/ready check for example really be coupled to whether the actual webhook is accepting requests?

But as said, better as nothing imo.

yashvardhan-kukreja · 2021-09-25T05:16:28Z

Will this healthz/ready check for example really be coupled to whether the actual webhook is accepting requests?

That's a good question. I feel that the check you are describing is more of a synthetic monitoring check rather than a static health check. So, I don't see a straightforward way to have such a health-check to ensure both that the webhook server is running AND it is successfully accepting the admission requests it is supposed to accept. Nonetheless, I am not at all denying the use of it :)
Yet, I believe that the health-check I described would at least fairly suffice for the setup of liveness and readiness probes coz that check would atleast confirm the "existence" of the webhook server in the sense that there is something running at 9443 and that thing in reality is the webhook server. And that at least is, as you mentioned, fairly better than nothing for the probes.
Many of the tools I have worked with make use of such kinds of static health-checks to determine the health of their services for their liveness and readiness probes.

So, I would suggest moving ahead with the static health-check I described, for now and establish a readiness and a liveness probe over it :)

What do you think @adrianludwin :)

adrianludwin · 2021-09-25T23:58:10Z

So it turns out that controller-runtime 0.10.1 is already out, so I'd be in favour of just updating and getting it for free :) Why don't I quickly do that.

Well. I'm not actually sure if you get it for free, e.g. I don't know if there's someone we need to do to take advantage of this new feature. But that still seems easier than building it from scratch.

wdyt?

adrianludwin · 2021-09-26T01:22:12Z

I'm working on #84 to upgrade to controller-runtime 1.10 but it's blocked on another change. Hopefully it'll be unblocked soon.

adrianludwin · 2021-09-26T03:37:35Z

Ok #84 is merged, so we're on controller-runtime 0.10.1 now. What else do we need to do to make this work?

mgoltzsche · 2021-11-24T19:10:43Z

After enabling the readiness/liveness probe endpoint within the cmd/manager/main.go, you could declare the corresponding readiness/liveness probes within the Deployment at config/manager/manager.yaml.
However, please note that this might conflict with the leader election mechanism during updates: when an update is complete only if the readiness probe becomes complete when the leader has been elected, an update will never complete if the previous Pod still exists and is the leader. To avoid this, I think you need to set the update strategy within the manager Deployment resource to Recreate. However this would mean that the webhook would become unavailable which might not be acceptable. Therefore, I think the webhook would need to be served by a separate Deployment, allowing to scale both the operator (asynchronous controller) and webhook/admission controller independently and apply different update strategies to both.

k8s-triage-robot · 2022-02-22T20:04:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

This was referenced Jan 30, 2022

Add liveness and startup/readiness probes #139

Merged

Enable HNC leader election in controllers #94

Closed

erikgb mentioned this issue Feb 17, 2022

"failed calling webhook" error when creating namespace #147

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2022

k8s-ci-robot closed this as completed in #139 Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health and readiness checks to hnc-controller-manager #68

Add health and readiness checks to hnc-controller-manager #68

hferentschik commented Aug 13, 2021

adrianludwin commented Aug 16, 2021 via email

yashvardhan-kukreja commented Sep 21, 2021 •

edited

Loading

hferentschik commented Sep 24, 2021

yashvardhan-kukreja commented Sep 25, 2021

adrianludwin commented Sep 25, 2021

adrianludwin commented Sep 26, 2021

adrianludwin commented Sep 26, 2021

mgoltzsche commented Nov 24, 2021 •

edited

Loading

k8s-triage-robot commented Feb 22, 2022

Add health and readiness checks to hnc-controller-manager #68

Add health and readiness checks to hnc-controller-manager #68

Comments

hferentschik commented Aug 13, 2021

adrianludwin commented Aug 16, 2021 via email

yashvardhan-kukreja commented Sep 21, 2021 • edited Loading

hferentschik commented Sep 24, 2021

yashvardhan-kukreja commented Sep 25, 2021

adrianludwin commented Sep 25, 2021

adrianludwin commented Sep 26, 2021

adrianludwin commented Sep 26, 2021

mgoltzsche commented Nov 24, 2021 • edited Loading

k8s-triage-robot commented Feb 22, 2022

yashvardhan-kukreja commented Sep 21, 2021 •

edited

Loading

mgoltzsche commented Nov 24, 2021 •

edited

Loading