Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add liveness and startup/readiness probes #139

Merged
merged 1 commit into from
Mar 16, 2022

Conversation

erikgb
Copy link
Contributor

@erikgb erikgb commented Jan 30, 2022

This adds liveness and startup/readiness probes to the HNC Deployment using functionality recently added to controller-runtime. The probes configuration might need some adjustments of timeouts etc. to work without issues for a typical HNC deployment. The default configuration should probably be adjusted; timeouts etc., and should be tested in a live cluster - which I do not have available.

Testing: Ran all tests successfully on my local workstation, when using the simple ping probe. But when changing to the deeper webhook probe, the container never becomes ready. Seems to be an issue related to cert-rotator - which I do not know at all. We use cert-manager. HELP!

Fixes #68

/kind feature
/cc @adrianludwin

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 30, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @erikgb. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 30, 2022
Copy link
Contributor

@adrianludwin adrianludwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test
/cc @rjbez17

Just to confirm - since these checks don't hook into any HNC logic (currently), what are they actually testing that K8s wasn't already to detect? E.g. I would have assumed that if the pod crashed on startup we'd already know. So are we actually flagging any new failure modes here?

Or is this more to be compliant with clusters that might have a rule like "all pods should have readiness and liveness checks?"

Thanks!

cmd/manager/main.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 7, 2022
@erikgb
Copy link
Contributor Author

erikgb commented Feb 7, 2022

Just to confirm - since these checks don't hook into any HNC logic (currently), what are they actually testing that K8s wasn't already to detect? E.g. I would have assumed that if the pod crashed on startup we'd already know. So are we actually flagging any new failure modes here?

As I tried to say in our call last Friday, we could probably do more to get correct health assessment in HNC, as this does not touch the non-standard controller code in HNC. So consider this a light start. I am not exactly sure what the probe actually checks by default in controller-runtime, but I would expect it to not mark a container as ready before the cache is synced. And it seems to be working fine for all our controllers, that has both reconcilers and admission webhooks (of all kinds).

I am observing a bunch of API calls failing because the HNC webhook is configured in the API server, but no webhook endpoint ready to handle the (totally unrelated) request. This is not optimal, and affects general cluster availability! I would like the webhook-endpoint to be highly available, as it is critical for my cluster.

@adrianludwin
Copy link
Contributor

Yes, I know that hooking into the HNC logic would be a good next step, I'm just curious as to what this PR does in isolation. I forgot that the cache sync was a possibility.

Does it make sense for you to try this in a cluster and see if it improves the problems you're describing?

@erikgb
Copy link
Contributor Author

erikgb commented Feb 7, 2022

Does it make sense for you to try this in a cluster and see if it improves the problems you're describing?

Are PR images available in any public registry? We run multi-tenant clusters, so it is not always that easy to test work-in-progress like this - since I am not personally a cluster-admin. We practice GitOps, and I think creating a new pipeline just to test this can be a bit tricky. But I can try to see what can be done.

I would not expect this PR to fully fix our problem, since I believe HNC requires additional time to "warm up".

@adrianludwin
Copy link
Contributor

Are PR images available in any public registry?

No, you'd have to build the image yourself and deploy it to a test cluster.

We run multi-tenant clusters, so it is not always that easy to test work-in-progress like this - since I am not personally a cluster-admin. We practice GitOps, and I think creating a new pipeline just to test this can be a bit tricky. But I can try to see what can be done.

I wouldn't try this in prod :)

I would not expect this PR to fully fix our problem, since I believe HNC requires additional time to "warm up".

Yeah agreed, I'd just be curious as to whether the probes gave any additional information at all.

@erikgb erikgb marked this pull request as draft February 11, 2022 16:08
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 11, 2022
@erikgb
Copy link
Contributor Author

erikgb commented Feb 11, 2022

I am trying to base the health check on:

https://github.com/kubernetes-sigs/controller-runtime/blob/6c7c1d73fc5ed965c2962d6f4552b15f9841ce86/pkg/webhook/server.go#L293-L317

But HNC seems unable to start the cert-rotator successfully when enabling this. I am looking into it, but please let me know if you have any ideas.

@adrianludwin
Copy link
Contributor

Maybe paste more information here so that others can have a look at what you're seeing?

@erikgb
Copy link
Contributor Author

erikgb commented Feb 11, 2022

With the Ping health check all is good. But when I try the include the health check that is supposed to check is the webhook endpoint is ready:

diff --git a/cmd/manager/main.go b/cmd/manager/main.go
index b4117677..90db78d6 100644
--- a/cmd/manager/main.go
+++ b/cmd/manager/main.go
@@ -236,7 +236,8 @@ func createManager() ctrl.Manager {
                setupLog.Error(err, "unable to set up health check")
                os.Exit(1)
        }
-       if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
+       readyChecker := mgr.GetWebhookServer().StartedChecker()
+       if err := mgr.AddReadyzCheck("readyz", readyChecker); err != nil {
                setupLog.Error(err, "unable to set up ready check")

Nothing works, and so far I haven't got anything interesting out of the logs:

{"level":"info","ts":1644597871.7143803,"logger":"setup","msg":"Starting main.go:init()"}
{"level":"info","ts":1644597871.7166357,"logger":"setup","msg":"Finished main.go:init()"}
{"level":"info","ts":1644597871.7166703,"logger":"setup","msg":"Parsing flags"}
{"level":"info","ts":1644597871.7167115,"logger":"setup","msg":"Creating OpenCensus->Stackdriver exporter"}
{"level":"error","ts":1644597871.7183151,"logger":"setup","msg":"Could not create Stackdriver exporter","error":"stackdriver: google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.","stacktrace":"main.main\n\t/workspace/cmd/manager/main.go:91\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255"}
{"level":"info","ts":1644597871.7183697,"logger":"setup","msg":"Creating Prometheus exporter"}
{"level":"info","ts":1644597871.7184758,"logger":"setup","msg":"Configuring controller-manager"}
{"level":"info","ts":1644597871.7393453,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1644597871.739669,"logger":"setup","msg":"Starting certificate generation"}
{"level":"info","ts":1644597871.73979,"logger":"setup","msg":"Starting manager"}
{"level":"info","ts":1644597871.739821,"logger":"setup","msg":"Waiting for certificate generation to complete"}
{"level":"info","ts":1644597871.7400708,"logger":"controller-runtime.webhook.webhooks","msg":"Starting webhook server"}
{"level":"info","ts":1644597871.7401395,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":1644597871.740153,"msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":1644597871.7401528,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":1644597871.7402065,"msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":1644597871.7402062,"logger":"cert-rotation","msg":"starting cert rotator controller"}
{"level":"info","ts":1644597871.740309,"logger":"controller.cert-rotator","msg":"Starting EventSource","source":"&{{%!s(*v1.Secret=&{{ } {      0 {{0 0 <nil>}} <nil> <nil> map[] map[] [] []  []} <nil> map[] map[] }) %!s(*cache.informerCache=&{0xc0006fbdc0}) %!s(chan error=<nil>) %!s(func()=<nil>)}}"}
{"level":"info","ts":1644597871.7403488,"logger":"controller.cert-rotator","msg":"Starting EventSource","source":"&{{%!s(*unstructured.Unstructured=&{map[apiVersion:admissionregistration.k8s.io/v1 kind:ValidatingWebhookConfiguration]}) %!s(*cache.informerCache=&{0xc0006fbdc0}) %!s(chan error=<nil>) %!s(func()=<nil>)}}"}
{"level":"info","ts":1644597871.740357,"logger":"controller.cert-rotator","msg":"Starting EventSource","source":"&{{%!s(*unstructured.Unstructured=&{map[apiVersion:admissionregistration.k8s.io/v1 kind:MutatingWebhookConfiguration]}) %!s(*cache.informerCache=&{0xc0006fbdc0}) %!s(chan error=<nil>) %!s(func()=<nil>)}}"}
{"level":"info","ts":1644597871.7403603,"logger":"controller.cert-rotator","msg":"Starting Controller"}
{"level":"error","ts":1644597871.7403653,"logger":"controller.cert-rotator","msg":"Could not wait for Cache to sync","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced"}
{"level":"error","ts":1644597871.7403746,"msg":"error received after stop sequence was engaged","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced"}
{"level":"info","ts":1644597901.7410984,"msg":"Stopping and waiting for caches"}
{"level":"info","ts":1644597901.7412927,"msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":1644597901.7413201,"msg":"Wait completed, proceeding to shutdown the manager"}
{"level":"error","ts":1644597901.7410095,"logger":"setup","msg":"problem running manager","error":"[open /tmp/k8s-webhook-server/serving-certs/tls.crt: no such file or directory, failed waiting for all runnables to end within grace period of 30s: context deadline exceeded]","errorCauses":[{"error":"open /tmp/k8s-webhook-server/serving-certs/tls.crt: no such file or directory"},{"error":"failed waiting for all runnables to end within grace period of 30s: context deadline exceeded"}],"stacktrace":"runtime.main\n\t/usr/local/go/src/runtime/proc.go:255"}

@erikgb erikgb force-pushed the feat/probes branch 2 times, most recently from b6c52fe to f9553f6 Compare February 19, 2022 12:18
@erikgb erikgb marked this pull request as ready for review February 19, 2022 12:20
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2022
@erikgb erikgb changed the title Add liveness and readiness probes Add liveness and startup/readiness probes Feb 19, 2022
@erikgb
Copy link
Contributor Author

erikgb commented Feb 19, 2022

@adrianludwin Even if the CI reports this PR as ok, I am unable to make it work in my local k3s-cluster. And I do not understand why, but seems to be related to cert-rotator - which I don't know. We use cert-manager for all cert-related work. Could you test this PR - in some of your environments, please?

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 19, 2022
@adrianludwin
Copy link
Contributor

adrianludwin commented Mar 4, 2022

As discussed earlier - this works for me, as is, on GKE 1.23 and Kind 1.21. So I'm happy to see this go in if you are.

@rjbez17 what do you think about this? Any concerns?

/lgtm
/approve
/hold
/assign @rjbez17

@adrianludwin
Copy link
Contributor

/lgtm cancel
/approve cancel

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2022
@adrianludwin
Copy link
Contributor

@erikgb Can you please fix the commit messages? You can say that you couldn't get it to work, but that I could.

@adrianludwin
Copy link
Contributor

/lgtm cancel
/approve cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2022
@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 6, 2022
@erikgb
Copy link
Contributor Author

erikgb commented Mar 6, 2022

@erikgb Can you please fix the commit messages? You can say that you couldn't get it to work, but that I could.

Done!

This adds liveness and startup/readiness probes to the HNC Deployment using functionality recently added to controller-runtime. The probes configuration might need some adjustments of timeouts etc. to work without issues for a typical HNC deployment. The default configuration should probably be adjusted; timeouts etc., and should be tested in a live cluster - which I do not have available.

Testing: Ran all tests successfully on my local workstation, when using the simple ping probe. But when changing to the deeper webhook probe, the container never becomes ready. Seems to be an issue related to cert-rotator - which I do not know at all. We use cert-manager. On request, Adrian Ludwin took a look at my issues, and somehow this seems to work for him.
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 6, 2022
Copy link
Contributor

@adrianludwin adrianludwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

I'm comfortable with this given my testing results.

@adrianludwin
Copy link
Contributor

Checking with @rjbez17 to see if he wants to review, otherwise I'll remove the hold later today.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 16, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianludwin, erikgb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 16, 2022
@adrianludwin
Copy link
Contributor

Ryan appears to be away so let's just put this in.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 16, 2022
@k8s-ci-robot k8s-ci-robot merged commit 7a913f1 into kubernetes-sigs:master Mar 16, 2022
@adrianludwin adrianludwin added this to the release-v1.0 milestone Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add health and readiness checks to hnc-controller-manager
4 participants