Add liveness and startup/readiness probes #139

erikgb · 2022-01-30T19:36:46Z

This adds liveness and startup/readiness probes to the HNC Deployment using functionality recently added to controller-runtime. The probes configuration might need some adjustments of timeouts etc. to work without issues for a typical HNC deployment. The default configuration should probably be adjusted; timeouts etc., and should be tested in a live cluster - which I do not have available.

Testing: Ran all tests successfully on my local workstation, when using the simple ping probe. But when changing to the deeper webhook probe, the container never becomes ready. Seems to be an issue related to cert-rotator - which I do not know at all. We use cert-manager. HELP!

Fixes #68

/kind feature
/cc @adrianludwin

k8s-ci-robot · 2022-01-30T19:36:55Z

Hi @erikgb. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

adrianludwin

/ok-to-test
/cc @rjbez17

Just to confirm - since these checks don't hook into any HNC logic (currently), what are they actually testing that K8s wasn't already to detect? E.g. I would have assumed that if the pod crashed on startup we'd already know. So are we actually flagging any new failure modes here?

Or is this more to be compliant with clusters that might have a rule like "all pods should have readiness and liveness checks?"

Thanks!

cmd/manager/main.go

erikgb · 2022-02-07T18:53:06Z

Just to confirm - since these checks don't hook into any HNC logic (currently), what are they actually testing that K8s wasn't already to detect? E.g. I would have assumed that if the pod crashed on startup we'd already know. So are we actually flagging any new failure modes here?

As I tried to say in our call last Friday, we could probably do more to get correct health assessment in HNC, as this does not touch the non-standard controller code in HNC. So consider this a light start. I am not exactly sure what the probe actually checks by default in controller-runtime, but I would expect it to not mark a container as ready before the cache is synced. And it seems to be working fine for all our controllers, that has both reconcilers and admission webhooks (of all kinds).

I am observing a bunch of API calls failing because the HNC webhook is configured in the API server, but no webhook endpoint ready to handle the (totally unrelated) request. This is not optimal, and affects general cluster availability! I would like the webhook-endpoint to be highly available, as it is critical for my cluster.

adrianludwin · 2022-02-07T18:57:01Z

Yes, I know that hooking into the HNC logic would be a good next step, I'm just curious as to what this PR does in isolation. I forgot that the cache sync was a possibility.

Does it make sense for you to try this in a cluster and see if it improves the problems you're describing?

erikgb · 2022-02-07T19:09:37Z

Does it make sense for you to try this in a cluster and see if it improves the problems you're describing?

Are PR images available in any public registry? We run multi-tenant clusters, so it is not always that easy to test work-in-progress like this - since I am not personally a cluster-admin. We practice GitOps, and I think creating a new pipeline just to test this can be a bit tricky. But I can try to see what can be done.

I would not expect this PR to fully fix our problem, since I believe HNC requires additional time to "warm up".

adrianludwin · 2022-02-07T20:53:31Z

Are PR images available in any public registry?

No, you'd have to build the image yourself and deploy it to a test cluster.

We run multi-tenant clusters, so it is not always that easy to test work-in-progress like this - since I am not personally a cluster-admin. We practice GitOps, and I think creating a new pipeline just to test this can be a bit tricky. But I can try to see what can be done.

I wouldn't try this in prod :)

I would not expect this PR to fully fix our problem, since I believe HNC requires additional time to "warm up".

Yeah agreed, I'd just be curious as to whether the probes gave any additional information at all.

erikgb · 2022-02-11T16:16:35Z

I am trying to base the health check on:

https://github.com/kubernetes-sigs/controller-runtime/blob/6c7c1d73fc5ed965c2962d6f4552b15f9841ce86/pkg/webhook/server.go#L293-L317

But HNC seems unable to start the cert-rotator successfully when enabling this. I am looking into it, but please let me know if you have any ideas.

adrianludwin · 2022-02-11T16:25:38Z

Maybe paste more information here so that others can have a look at what you're seeing?

erikgb · 2022-02-11T20:55:30Z

With the Ping health check all is good. But when I try the include the health check that is supposed to check is the webhook endpoint is ready:

diff --git a/cmd/manager/main.go b/cmd/manager/main.go
index b4117677..90db78d6 100644
--- a/cmd/manager/main.go
+++ b/cmd/manager/main.go
@@ -236,7 +236,8 @@ func createManager() ctrl.Manager {
                setupLog.Error(err, "unable to set up health check")
                os.Exit(1)
        }
-       if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
+       readyChecker := mgr.GetWebhookServer().StartedChecker()
+       if err := mgr.AddReadyzCheck("readyz", readyChecker); err != nil {
                setupLog.Error(err, "unable to set up ready check")

Nothing works, and so far I haven't got anything interesting out of the logs:

{"level":"info","ts":1644597871.7143803,"logger":"setup","msg":"Starting main.go:init()"}
{"level":"info","ts":1644597871.7166357,"logger":"setup","msg":"Finished main.go:init()"}
{"level":"info","ts":1644597871.7166703,"logger":"setup","msg":"Parsing flags"}
{"level":"info","ts":1644597871.7167115,"logger":"setup","msg":"Creating OpenCensus->Stackdriver exporter"}
{"level":"error","ts":1644597871.7183151,"logger":"setup","msg":"Could not create Stackdriver exporter","error":"stackdriver: google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.","stacktrace":"main.main\n\t/workspace/cmd/manager/main.go:91\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255"}
{"level":"info","ts":1644597871.7183697,"logger":"setup","msg":"Creating Prometheus exporter"}
{"level":"info","ts":1644597871.7184758,"logger":"setup","msg":"Configuring controller-manager"}
{"level":"info","ts":1644597871.7393453,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1644597871.739669,"logger":"setup","msg":"Starting certificate generation"}
{"level":"info","ts":1644597871.73979,"logger":"setup","msg":"Starting manager"}
{"level":"info","ts":1644597871.739821,"logger":"setup","msg":"Waiting for certificate generation to complete"}
{"level":"info","ts":1644597871.7400708,"logger":"controller-runtime.webhook.webhooks","msg":"Starting webhook server"}
{"level":"info","ts":1644597871.7401395,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":1644597871.740153,"msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":1644597871.7401528,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":1644597871.7402065,"msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":1644597871.7402062,"logger":"cert-rotation","msg":"starting cert rotator controller"}
{"level":"info","ts":1644597871.740309,"logger":"controller.cert-rotator","msg":"Starting EventSource","source":"&{{%!s(*v1.Secret=&{{ } {      0 {{0 0 <nil>}} <nil> <nil> map[] map[] [] []  []} <nil> map[] map[] }) %!s(*cache.informerCache=&{0xc0006fbdc0}) %!s(chan error=<nil>) %!s(func()=<nil>)}}"}
{"level":"info","ts":1644597871.7403488,"logger":"controller.cert-rotator","msg":"Starting EventSource","source":"&{{%!s(*unstructured.Unstructured=&{map[apiVersion:admissionregistration.k8s.io/v1 kind:ValidatingWebhookConfiguration]}) %!s(*cache.informerCache=&{0xc0006fbdc0}) %!s(chan error=<nil>) %!s(func()=<nil>)}}"}
{"level":"info","ts":1644597871.740357,"logger":"controller.cert-rotator","msg":"Starting EventSource","source":"&{{%!s(*unstructured.Unstructured=&{map[apiVersion:admissionregistration.k8s.io/v1 kind:MutatingWebhookConfiguration]}) %!s(*cache.informerCache=&{0xc0006fbdc0}) %!s(chan error=<nil>) %!s(func()=<nil>)}}"}
{"level":"info","ts":1644597871.7403603,"logger":"controller.cert-rotator","msg":"Starting Controller"}
{"level":"error","ts":1644597871.7403653,"logger":"controller.cert-rotator","msg":"Could not wait for Cache to sync","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced"}
{"level":"error","ts":1644597871.7403746,"msg":"error received after stop sequence was engaged","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced"}
{"level":"info","ts":1644597901.7410984,"msg":"Stopping and waiting for caches"}
{"level":"info","ts":1644597901.7412927,"msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":1644597901.7413201,"msg":"Wait completed, proceeding to shutdown the manager"}
{"level":"error","ts":1644597901.7410095,"logger":"setup","msg":"problem running manager","error":"[open /tmp/k8s-webhook-server/serving-certs/tls.crt: no such file or directory, failed waiting for all runnables to end within grace period of 30s: context deadline exceeded]","errorCauses":[{"error":"open /tmp/k8s-webhook-server/serving-certs/tls.crt: no such file or directory"},{"error":"failed waiting for all runnables to end within grace period of 30s: context deadline exceeded"}],"stacktrace":"runtime.main\n\t/usr/local/go/src/runtime/proc.go:255"}

erikgb · 2022-02-19T12:33:02Z

@adrianludwin Even if the CI reports this PR as ok, I am unable to make it work in my local k3s-cluster. And I do not understand why, but seems to be related to cert-rotator - which I don't know. We use cert-manager for all cert-related work. Could you test this PR - in some of your environments, please?

/hold

adrianludwin · 2022-03-04T23:52:19Z

As discussed earlier - this works for me, as is, on GKE 1.23 and Kind 1.21. So I'm happy to see this go in if you are.

@rjbez17 what do you think about this? Any concerns?

/lgtm
/approve
/hold
/assign @rjbez17

adrianludwin · 2022-03-04T23:53:05Z

/lgtm cancel
/approve cancel

adrianludwin · 2022-03-04T23:53:51Z

@erikgb Can you please fix the commit messages? You can say that you couldn't get it to work, but that I could.

adrianludwin · 2022-03-04T23:54:59Z

/lgtm cancel
/approve cancel

erikgb · 2022-03-06T18:40:42Z

@erikgb Can you please fix the commit messages? You can say that you couldn't get it to work, but that I could.

Done!

This adds liveness and startup/readiness probes to the HNC Deployment using functionality recently added to controller-runtime. The probes configuration might need some adjustments of timeouts etc. to work without issues for a typical HNC deployment. The default configuration should probably be adjusted; timeouts etc., and should be tested in a live cluster - which I do not have available. Testing: Ran all tests successfully on my local workstation, when using the simple ping probe. But when changing to the deeper webhook probe, the container never becomes ready. Seems to be an issue related to cert-rotator - which I do not know at all. We use cert-manager. On request, Adrian Ludwin took a look at my issues, and somehow this seems to work for him.

adrianludwin

/lgtm
/approve

I'm comfortable with this given my testing results.

adrianludwin · 2022-03-16T15:19:35Z

Checking with @rjbez17 to see if he wants to review, otherwise I'll remove the hold later today.

k8s-ci-robot · 2022-03-16T15:32:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianludwin, erikgb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [adrianludwin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

adrianludwin · 2022-03-16T22:02:05Z

Ryan appears to be away so let's just put this in.

/hold cancel

k8s-ci-robot requested a review from adrianludwin January 30, 2022 19:36

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 30, 2022

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 30, 2022

adrianludwin reviewed Feb 7, 2022

View reviewed changes

cmd/manager/main.go Outdated Show resolved Hide resolved

k8s-ci-robot requested a review from rjbez17 February 7, 2022 00:03

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 7, 2022

erikgb force-pushed the feat/probes branch from 501bd8a to 6f72f91 Compare February 7, 2022 18:59

erikgb requested a review from adrianludwin February 7, 2022 19:00

erikgb marked this pull request as draft February 11, 2022 16:08

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 11, 2022

erikgb force-pushed the feat/probes branch 2 times, most recently from b6c52fe to f9553f6 Compare February 19, 2022 12:18

erikgb marked this pull request as ready for review February 19, 2022 12:20

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2022

erikgb force-pushed the feat/probes branch from f9553f6 to 751dbfe Compare February 19, 2022 12:29

erikgb changed the title ~~Add liveness and readiness probes~~ Add liveness and startup/readiness probes Feb 19, 2022

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 19, 2022

k8s-ci-robot assigned adrianludwin and rjbez17 Mar 4, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2022

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2022

erikgb force-pushed the feat/probes branch from 751dbfe to 2601cd9 Compare March 6, 2022 18:40

k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 6, 2022

erikgb force-pushed the feat/probes branch from 2601cd9 to db2831b Compare March 6, 2022 18:42

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 6, 2022

adrianludwin reviewed Mar 16, 2022

View reviewed changes

adrianludwin mentioned this pull request Mar 16, 2022

Release 1.0 and introduction for production readiness? #150

Closed

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 16, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 16, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 16, 2022

k8s-ci-robot merged commit 7a913f1 into kubernetes-sigs:master Mar 16, 2022

erikgb mentioned this pull request Mar 16, 2022

Replace unstructured query with structured query #154

Merged

jvaibhav123 mentioned this pull request Mar 23, 2022

Getting error "tls.crt: no such file or directory" when building HNC from main branch #158

Closed

erikgb mentioned this pull request Mar 24, 2022

Improve probes to avoid cluster disturbance #162

Closed

adrianludwin added this to the release-v1.0 milestone Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add liveness and startup/readiness probes #139

Add liveness and startup/readiness probes #139

erikgb commented Jan 30, 2022 •

edited

Loading

k8s-ci-robot commented Jan 30, 2022

adrianludwin left a comment

erikgb commented Feb 7, 2022

adrianludwin commented Feb 7, 2022

erikgb commented Feb 7, 2022

adrianludwin commented Feb 7, 2022

erikgb commented Feb 11, 2022 •

edited

Loading

adrianludwin commented Feb 11, 2022

erikgb commented Feb 11, 2022

erikgb commented Feb 19, 2022 •

edited

Loading

adrianludwin commented Mar 4, 2022 •

edited

Loading

adrianludwin commented Mar 4, 2022

adrianludwin commented Mar 4, 2022

adrianludwin commented Mar 4, 2022

erikgb commented Mar 6, 2022

adrianludwin left a comment

adrianludwin commented Mar 16, 2022

k8s-ci-robot commented Mar 16, 2022

adrianludwin commented Mar 16, 2022

Add liveness and startup/readiness probes #139

Add liveness and startup/readiness probes #139

Conversation

erikgb commented Jan 30, 2022 • edited Loading

k8s-ci-robot commented Jan 30, 2022

adrianludwin left a comment

Choose a reason for hiding this comment

erikgb commented Feb 7, 2022

adrianludwin commented Feb 7, 2022

erikgb commented Feb 7, 2022

adrianludwin commented Feb 7, 2022

erikgb commented Feb 11, 2022 • edited Loading

adrianludwin commented Feb 11, 2022

erikgb commented Feb 11, 2022

erikgb commented Feb 19, 2022 • edited Loading

adrianludwin commented Mar 4, 2022 • edited Loading

adrianludwin commented Mar 4, 2022

adrianludwin commented Mar 4, 2022

adrianludwin commented Mar 4, 2022

erikgb commented Mar 6, 2022

adrianludwin left a comment

Choose a reason for hiding this comment

adrianludwin commented Mar 16, 2022

k8s-ci-robot commented Mar 16, 2022

adrianludwin commented Mar 16, 2022

erikgb commented Jan 30, 2022 •

edited

Loading

erikgb commented Feb 11, 2022 •

edited

Loading

erikgb commented Feb 19, 2022 •

edited

Loading

adrianludwin commented Mar 4, 2022 •

edited

Loading