-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add health and readiness checks to hnc-controller-manager #68
Comments
This sounds like a useful and sensible change to make, thanks! Please ping
us when controller-runtime 0.10 is out if we forget.
…On Fri, Aug 13, 2021 at 5:10 AM Hardy Ferentschik ***@***.***> wrote:
I am working on a project where we use HNC as a sub-component. The project
ships and installs via Helm. For now, we create our own HNC Helm chart from
the provided resources and use this Helm chart as a dependency for our
project.
The validating webhook causes us some grief, especially since version
0.8.0 (maybe some code changes increased startup-time of the webhook
service in the manager). The problem is described in Helm issue
helm/helm#10023 <helm/helm#10023>. The chart
resources are installed fine and HNC, as well as our project, is working
fine, however, the Helm release status is stuck on *pending-install*. The
reason is that directly after installing all resources, Helm tries to
update the Secret containing the Helm release information. At this time the
validating webhook is not responding to request yet and the release update
operation fails.
The error looks like this:
install.go:387: [debug] failed to record the release: update: failed to update: Internal error occurred: failed calling webhook "objects.hnc.x-k8s.io": Post "https://hnc-webhook-service.hnc-system.svc:443/validate-objects?timeout=2s": dial tcp 10.96.16.139:443: connect: connection refused
I think Helm should retry the update operation for some time, but I also
think that the window in which this problem can occur could be minimized if
there were a readiness probe for the manager.
I had a quick look at the code and it seems controller-runtime is used for
implementing the webhook. There seems to be no easy way to expose a
health/readiness check in the library. I've seen that there is an issue for
that I controller-runtime - kubernetes-sigs/controller-runtime#723
<kubernetes-sigs/controller-runtime#723>. There
is also a PR already merged - kubernetes-sigs/controller-runtime#1588
<kubernetes-sigs/controller-runtime#1588>. The
feature is scheduled for release 0.10.x of controller-runtime. Once this
release is out, it might make sense to upgrade controller-runtime and
expose the check.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#68>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE43PZBJWZ4XHS6SFHRT5GLT4TORZANCNFSM5CDGBZUQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Hi @adrianludwin and @hferentschik 👋🏼 I just hovered over the code and found that it would be a pretty small addition, somewhere on the lines of:
Seems pretty simple and straightforward but I'd love to hear your views :) PS: if you think that the above idea is valid and this issue can be resolved, I'd love to pick this issue up and quickly sort it out :) |
Sure, this was just a suggestion. I am also not so familiar with the codebase, so I was not aware of whether there are other options. I only knew that the operator sdk did not offer this (a shortcoming I ran into myself before) and that there is an open issue for that on the operator sdk side. What you are describing makes sense. It probably is a step forward either way. What I am wondering is about the sematics. Will this healthz/ready check for example really be coupled to whether the actual webhook is accepting requests? But as said, better as nothing imo. |
That's a good question. I feel that the check you are describing is more of a synthetic monitoring check rather than a static health check. So, I don't see a straightforward way to have such a health-check to ensure both that the webhook server is running AND it is successfully accepting the admission requests it is supposed to accept. Nonetheless, I am not at all denying the use of it :) So, I would suggest moving ahead with the static health-check I described, for now and establish a readiness and a liveness probe over it :) What do you think @adrianludwin :) |
So it turns out that controller-runtime 0.10.1 is already out, so I'd be in favour of just updating and getting it for free :) Why don't I quickly do that. Well. I'm not actually sure if you get it for free, e.g. I don't know if there's someone we need to do to take advantage of this new feature. But that still seems easier than building it from scratch. wdyt? |
I'm working on #84 to upgrade to controller-runtime 1.10 but it's blocked on another change. Hopefully it'll be unblocked soon. |
Ok #84 is merged, so we're on controller-runtime 0.10.1 now. What else do we need to do to make this work? |
After enabling the readiness/liveness probe endpoint within the |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
I am working on a project where we use HNC as a sub-component. The project ships and installs via Helm. For now, we create our own HNC Helm chart from the provided resources and use this Helm chart as a dependency for our project.
The validating webhook causes us some grief, especially since version 0.8.0 (maybe some code changes increased startup-time of the webhook service in the manager). The problem is described in Helm issue helm/helm#10023. The chart resources are installed fine and HNC, as well as our project, is working fine, however, the Helm release status is stuck on pending-install. The reason is that directly after installing all resources, Helm tries to update the Secret containing the Helm release information. At this time the validating webhook is not responding to request yet and the release update operation fails.
The error looks like this:
I think Helm should retry the update operation for some time, but I also think that the window in which this problem can occur could be minimized if there were a readiness probe for the manager.
I had a quick look at the code and it seems controller-runtime is used for implementing the webhook. There seems to be no easy way to expose a health/readiness check in the library. I've seen that there is an issue for that I controller-runtime - kubernetes-sigs/controller-runtime#723. There is also a PR already merged - kubernetes-sigs/controller-runtime#1588. The feature is scheduled for release 0.10.x of controller-runtime. Once this release is out, it might make sense to upgrade controller-runtime and expose the check.
The text was updated successfully, but these errors were encountered: