Webhook server could not answer in time #30

nyikesda · 2021-08-12T20:23:03Z

REPRODUCTION

certm-manager and its CRDs are deployed
webhook and operator helm chart are deployed
the webhook's and the operator's pods are in ready state
the attached Vertica descriptor is applied with the kubectl command: kubectl -n <namespace> apply -f <path-to-the-attached-file>
The following error is raised by the kubernetes API:

Error from server (InternalError): error when creating "test-upgrade-vertica/test_vertica_oper.yaml": Internal error occurred: failed calling webhook "vverticadb2.kb.io": Post "https://verticadb-webhook-webhook-service.analytical-processing-database-precodereview-1081.svc:443/validate-vertica-com-v1beta1-verticadb?timeout=10s": dial tcp 10.99.146.222:443: connect: connection refused

I tried to call the same command again and again and it was success after the 3rd call

The strange thing was that the webhook got the vertica descriptor, and a response was sent by the webhook (check the logs below). I guess there was a timeout between the webhook and the kubertetes API server, because I had to wait at least 1,5 sec to got any response, but I do not know how can I check it to give more details. If it was a timeout then it could be a wrong connection pool handling in the webhook or a wrong configuration in the kube-rbac-proxy.

webhook manager container log:

2021-08-11T10:54:35.861Z        DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/mutate-vertica-com-v1beta1-verticadb", "UID": "896bae9c-6be2-4259-8064-174092b89ce3", "kind": "vertica.com/v1beta1, Kind=VerticaDB", "resource": {"group":"vertica.com","version":"v1beta1","resource":"verticadbs"}}
2021-08-11T10:54:35.862Z        INFO    verticadb-resource      default {"name": "verticadb-upgrade-test"}
2021-08-11T10:54:35.863Z        DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/mutate-vertica-com-v1beta1-verticadb", "code": 200, "reason": "", "UID": "896bae9c-6be2-4259-8064-174092b89ce3", "allowed": true}

The text was updated successfully, but these errors were encountered:

spilchen · 2021-08-16T13:22:47Z

There is a lag between deploying the cert-manager and its ability to hand out certs. There are some steps that are outlined here to make sure the cert-manager is operational: https://cert-manager.io/docs/installation/verify/#manual-verification

Can you add this step to your deployment to see if that solves the issue?

We automated this wait in the following script: https://github.com/vertica/vertica-kubernetes/blob/main/scripts/wait-for-cert-manager-ready.sh

nyikesda · 2021-08-23T14:02:34Z

Hi @spilchen ,
Sorry for the late response, I was on vacation.
I forget to mention that I checked those pods as well, so the cert-manager was in ready state.
On the other hand I checked the generated certificate as well and it was injected into the created ValidatingWebhookConfiguration and the ValidatingWebhookConfiguration.

spilchen · 2021-08-23T17:46:09Z

How did you run these two steps in your repro?

webhook and operator helm chart are deployed
the webhook's and the operator's pods are in ready state

nyikesda · 2021-08-25T09:47:01Z

Attached helm charts:
helm-charts.zip

Deploy steps:

helm install vertica-webhook <path-to-the-webhook-folder> --namespace vertica
helm install vertica-operator <path-to-the-operator-folder> --namespace vertica

Ready state check:

kubectl get pod -n vertica

based on the output the operator and the webhook was ready

spilchen · 2021-08-26T13:11:36Z

This is an issue with the operator-sdk framework we are using. There was a PR that went into the controller-runtime that will help alleviate this (kubernetes-sigs/controller-runtime#1588). It provides a true health check that makes sure the webhook server is up and running. This only went into controller-runtime in July in the v0.9.3 release -- for comparison the current framework we use is on v0.7.2. And there was a minor fix for it that went into the v0.9.6 release. So I'm a bit resistant to move up controller-runtime to pick this up as I don't want to destabilize things. Is this a super urgent problem that needs to be fixed?

nyikesda · 2021-08-26T17:37:29Z

Do you mean that the /readyz gives false positive response? It could be a serious issue. I have to try some scenarios and I will back.

spilchen · 2021-08-26T18:34:16Z

The /readyz probe just tells you whether the pod is running. It doesn't tell you if the webhook port is being listened on. The listener is setup shortly after the pod starts. Until that happens, there is a timing window that fails any webhook request that comes in.

nyikesda · 2021-08-31T17:56:36Z

Hi @spilchen,
I have re-played my scenarios with the latest master version of the vertica operator, and the issue could not be reproduced.
On the other hand, it would be pleasure to give ready state only in case of the webhook is listening.

nyikesda · 2021-08-31T17:59:08Z

So I would like to keep this issue in opened for the controller-runtime version increase. It is not so urgent.

The e2e tests hit a failure because we had tried to create a VerticaDB before the webhook was fully up. This is a known issue (#30) with the webhook. We are going to work around this for now by adding a wait script when the tests issue make deploy.

harisokanovic · 2022-03-03T19:35:57Z

Hi @spilchen, @nyikesda,

Is it possible to work around this issue in helm?

My use case: Installing a VerticaDB resource in a helm chart. I'd like to install verticadb-operator via a dependency, but doing so seems to trigger this issue. A clean installation fails with the following error:

Error: failed to create resource: Internal error occurred: failed calling webhook "mverticadb.kb.io": Post "https://verticadb-operator-webhook-service.nopvertica.svc:443/mutate-vertica-com-v1beta1-verticadb?timeout=10s": no endpoints available for service "verticadb-operator-webhook-service"

spilchen · 2022-03-04T12:44:11Z

It isn't helm based, but we have a script that works around this issue that we use in our development environment (scripts/wait-for-webhook.sh).

However, we are in the process of upgrading the go packages in #165. This will bring in a new controller-runtime that properly implements a health check and should resolve this issue.

harisokanovic · 2022-03-04T19:57:54Z

Thanks. I already tried polling the webhook in a pre-install job. However, the way helm merges chart dependencies causes my job to run before any operator resources are deployed, and then stalls until timeout.

spilchen · 2022-03-09T20:13:53Z

We are still seeing cases where the helm --wait returns but the webhook still isn't 100% ready. Reopening this issue so that we can investigate this further. The scripted wait was added back in #169

harisokanovic · 2022-03-09T22:14:15Z

I see that as well. Our current solution is to run the aforementioned pre-install job in our chart and install both charts from Terraform sequentially instead of helm dependencies.

spilchen mentioned this issue Sep 27, 2021

Wait for webhook to come up for e2e tests #67

Merged

spilchen mentioned this issue Mar 4, 2022

Fix vulnerabilities #165

Merged

spilchen closed this as completed in #165 Mar 7, 2022

spilchen reopened this Mar 9, 2022

This was referenced Oct 21, 2022

Allow webhook CA bundle to be taken from secret #273

Merged

Operator pod readiness probe to wait for webhook #280

Merged

spilchen closed this as completed in #280 Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webhook server could not answer in time #30

Webhook server could not answer in time #30

nyikesda commented Aug 12, 2021 •

edited

Loading

spilchen commented Aug 16, 2021

nyikesda commented Aug 23, 2021

spilchen commented Aug 23, 2021

nyikesda commented Aug 25, 2021

spilchen commented Aug 26, 2021

nyikesda commented Aug 26, 2021

spilchen commented Aug 26, 2021

nyikesda commented Aug 31, 2021

nyikesda commented Aug 31, 2021

harisokanovic commented Mar 3, 2022

spilchen commented Mar 4, 2022

harisokanovic commented Mar 4, 2022

spilchen commented Mar 9, 2022

harisokanovic commented Mar 9, 2022

Webhook server could not answer in time #30

Webhook server could not answer in time #30

Comments

nyikesda commented Aug 12, 2021 • edited Loading

spilchen commented Aug 16, 2021

nyikesda commented Aug 23, 2021

spilchen commented Aug 23, 2021

nyikesda commented Aug 25, 2021

spilchen commented Aug 26, 2021

nyikesda commented Aug 26, 2021

spilchen commented Aug 26, 2021

nyikesda commented Aug 31, 2021

nyikesda commented Aug 31, 2021

harisokanovic commented Mar 3, 2022

spilchen commented Mar 4, 2022

harisokanovic commented Mar 4, 2022

spilchen commented Mar 9, 2022

harisokanovic commented Mar 9, 2022

nyikesda commented Aug 12, 2021 •

edited

Loading