Getting error "tls.crt: no such file or directory" when building HNC from main branch #158

jvaibhav123 · 2022-03-23T09:54:39Z

Hi,

Environment details:
Rancher: 2.6.2
k8s cluster: EKS
k8s version: v1.21.5

We tried to generate manifests and image by building source from main branch. We were able to do that. However , when we deploy the manifests, we were getting error

"msg":"problem running manager","error":"open /tmp/k8s-webhook-server/serving-certs/tls.crt: no such file or directory","stacktrace":"runtime.main\n\t/usr/local/go/src/runtime/proc.go:255"

The similar issue was mentioned #139 (comment).

We would like to use HNC feature of label and annotation propagation and require to fix above error. It would be great if you could elaborate to set up the certificates using cert manager. We do have cert manager installed in our environment.

Thanks in advance.

The text was updated successfully, but these errors were encountered:

erikgb · 2022-03-23T20:04:34Z

I was experiencing the same issue when working on #139, and I am now unable get the HNC controller pod ready to run e2e-tests after it got merged. 😒 @adrianludwin, maybe #139 should be reverted? 🤔 At least until we are able to find the root cause for this issue. I find it hard to believe that the k8s distro is to blame for this, but I am also using a Rancher-based distro (k3s 1.23).

IMO we should have better support for cert-manager in HNC, preferably have cert-manager as the default - with an opt-in using cert-controller. It requires an extra component in your cluster, which cert-controller doesn't, but cert-manager is like "bread and butter" for k8s nowadays....

adrianludwin · 2022-03-24T01:08:03Z

Ok, I'm not certain what's going on but let's revert #139. And we can try again later.

cert-controller is easy to disable by removing --enable-internal-cert-management from the manifest. cert-manager used to be supported by default but it's now commented out here. Perhaps we could find some way to build both manifests as part of our regular release?

adrianludwin · 2022-03-24T01:37:10Z

I'm also seeing that the postsubmits and periodics have been failing since #139 was merged on March 16. So weird that I'm not seeing the problem on GKE :( https://testgrid.k8s.io/wg-multi-tenancy-hnc#periodic-e2e-tests&width=90

adrianludwin · 2022-03-24T01:48:32Z

Ok, I've finally reproduced this problem on GKE. It was pretty dumb of me not to realize: if HNC is fully uninstalled, I can recreate the problem, but if the Secret from a prior installation is still there, everything's able to start up successfully.

I'm going to see if I can fix this before reverting, I suspect it might have to have something to do with the start order.

adrianludwin · 2022-03-24T03:34:39Z

/cc @erikgb

Ok, I finally understand the problem. The new setupChecks() function registers the health checks by calling mgr.GetWebhookServer().StartedChecker(), but unfortunately GetWebhookServer() actually starts a webhook server if one wasn't started already.

Before this change, HNC carefully did not create any webhooks until after the cert controller finished creating the secret that actually contains the certs, but this change breaks that assumption and tries to start the webhook server immediately simply to run the health probes. However, since the secret doesn't exist, the server fails to start, killing the entire process.

Currently thinking about how to solve this.

adrianludwin · 2022-03-24T03:52:52Z

This is easy to fix since we have access to certsReady which tells us if it's safe to start the webhook server. PR ready shortly.

See issue kubernetes-sigs#158. If the Secret with the certs doesn't exist, starting the webhook server will fail and HNC will exit. Before we configured the probes, we carefully didn't start the webhook server until after the certs were ready, but as a side effect of creating the probe functions, we accidentally started the webhook server before the certs were generated. This worked fine on a cluster that already had the certs (like the one I was testing on) but failed for everyone else (oops). The fix is simply to use a non-default checker that knows whether the certs have been generated, and avoids accidentally starting the webhook server if the certs don't exist yet. Tested: manually on a cluster without the Secret. Without this change, I can see the error from the webhook server complaining that the certs don't exist; with this change, I can see that that in HNC's first invocation, we get some "healthz check failed" messages, but the cert controller runs, generates the certs, and restarts HNC. The second invocation works just fine.

adrianludwin mentioned this issue Mar 24, 2022

Fix probes when installing on a fresh cluster #160

Merged

k8s-ci-robot closed this as completed in #160 Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting error "tls.crt: no such file or directory" when building HNC from main branch #158

Getting error "tls.crt: no such file or directory" when building HNC from main branch #158

jvaibhav123 commented Mar 23, 2022

erikgb commented Mar 23, 2022 •

edited

Loading

adrianludwin commented Mar 24, 2022

adrianludwin commented Mar 24, 2022

adrianludwin commented Mar 24, 2022

adrianludwin commented Mar 24, 2022

adrianludwin commented Mar 24, 2022

Getting error "tls.crt: no such file or directory" when building HNC from main branch #158

Getting error "tls.crt: no such file or directory" when building HNC from main branch #158

Comments

jvaibhav123 commented Mar 23, 2022

erikgb commented Mar 23, 2022 • edited Loading

adrianludwin commented Mar 24, 2022

adrianludwin commented Mar 24, 2022

adrianludwin commented Mar 24, 2022

adrianludwin commented Mar 24, 2022

adrianludwin commented Mar 24, 2022

erikgb commented Mar 23, 2022 •

edited

Loading