Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting error "tls.crt: no such file or directory" when building HNC from main branch #158

Closed
jvaibhav123 opened this issue Mar 23, 2022 · 6 comments · Fixed by #160
Closed

Comments

@jvaibhav123
Copy link

Hi,

Environment details:
Rancher: 2.6.2
k8s cluster: EKS
k8s version: v1.21.5

We tried to generate manifests and image by building source from main branch. We were able to do that. However , when we deploy the manifests, we were getting error

"msg":"problem running manager","error":"open /tmp/k8s-webhook-server/serving-certs/tls.crt: no such file or directory","stacktrace":"runtime.main\n\t/usr/local/go/src/runtime/proc.go:255" 

The similar issue was mentioned #139 (comment).

We would like to use HNC feature of label and annotation propagation and require to fix above error. It would be great if you could elaborate to set up the certificates using cert manager. We do have cert manager installed in our environment.

Thanks in advance.

@erikgb
Copy link
Contributor

erikgb commented Mar 23, 2022

I was experiencing the same issue when working on #139, and I am now unable get the HNC controller pod ready to run e2e-tests after it got merged. 😒 @adrianludwin, maybe #139 should be reverted? 🤔 At least until we are able to find the root cause for this issue. I find it hard to believe that the k8s distro is to blame for this, but I am also using a Rancher-based distro (k3s 1.23).

IMO we should have better support for cert-manager in HNC, preferably have cert-manager as the default - with an opt-in using cert-controller. It requires an extra component in your cluster, which cert-controller doesn't, but cert-manager is like "bread and butter" for k8s nowadays....

@adrianludwin
Copy link
Contributor

Ok, I'm not certain what's going on but let's revert #139. And we can try again later.

cert-controller is easy to disable by removing --enable-internal-cert-management from the manifest. cert-manager used to be supported by default but it's now commented out here. Perhaps we could find some way to build both manifests as part of our regular release?

@adrianludwin
Copy link
Contributor

I'm also seeing that the postsubmits and periodics have been failing since #139 was merged on March 16. So weird that I'm not seeing the problem on GKE :( https://testgrid.k8s.io/wg-multi-tenancy-hnc#periodic-e2e-tests&width=90

@adrianludwin
Copy link
Contributor

Ok, I've finally reproduced this problem on GKE. It was pretty dumb of me not to realize: if HNC is fully uninstalled, I can recreate the problem, but if the Secret from a prior installation is still there, everything's able to start up successfully.

I'm going to see if I can fix this before reverting, I suspect it might have to have something to do with the start order.

@adrianludwin
Copy link
Contributor

/cc @erikgb

Ok, I finally understand the problem. The new setupChecks() function registers the health checks by calling mgr.GetWebhookServer().StartedChecker(), but unfortunately GetWebhookServer() actually starts a webhook server if one wasn't started already.

Before this change, HNC carefully did not create any webhooks until after the cert controller finished creating the secret that actually contains the certs, but this change breaks that assumption and tries to start the webhook server immediately simply to run the health probes. However, since the secret doesn't exist, the server fails to start, killing the entire process.

Currently thinking about how to solve this.

@adrianludwin
Copy link
Contributor

This is easy to fix since we have access to certsReady which tells us if it's safe to start the webhook server. PR ready shortly.

adrianludwin added a commit to adrianludwin/hierarchical-namespaces that referenced this issue Mar 24, 2022
See issue kubernetes-sigs#158. If the Secret with the certs doesn't exist, starting the
webhook server will fail and HNC will exit. Before we configured the
probes, we carefully didn't start the webhook server until after the
certs were ready, but as a side effect of creating the probe functions,
we accidentally started the webhook server before the certs were
generated. This worked fine on a cluster that already had the certs
(like the one I was testing on) but failed for everyone else (oops).

The fix is simply to use a non-default checker that knows whether the
certs have been generated, and avoids accidentally starting the webhook
server if the certs don't exist yet.

Tested: manually on a cluster without the Secret. Without this change, I
can see the error from the webhook server complaining that the certs
don't exist; with this change, I can see that that in HNC's first
invocation, we get some "healthz check failed" messages, but the cert
controller runs, generates the certs, and restarts HNC. The second
invocation works just fine.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants