Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix probes when installing on a fresh cluster #160

Merged
merged 1 commit into from
Mar 24, 2022

Conversation

adrianludwin
Copy link
Contributor

@adrianludwin adrianludwin commented Mar 24, 2022

Fixes #158

If the Secret with the certs doesn't exist, starting the
webhook server will fail and HNC will exit. Before we configured the
probes, we carefully didn't start the webhook server until after the
certs were ready, but as a side effect of creating the probe functions,
we accidentally started the webhook server before the certs were
generated. This worked fine on a cluster that already had the certs
(like the one I was testing on) but failed for everyone else (oops).

The fix is simply to use a non-default checker that knows whether the
certs have been generated, and avoids accidentally starting the webhook
server if the certs don't exist yet.

Tested: manually on a cluster without the Secret. Without this change, I
can see the error from the webhook server complaining that the certs
don't exist; with this change, I can see that that in HNC's first
invocation, we get some "healthz check failed" messages, but the cert
controller runs, generates the certs, and restarts HNC. The second
invocation works just fine.

See issue kubernetes-sigs#158. If the Secret with the certs doesn't exist, starting the
webhook server will fail and HNC will exit. Before we configured the
probes, we carefully didn't start the webhook server until after the
certs were ready, but as a side effect of creating the probe functions,
we accidentally started the webhook server before the certs were
generated. This worked fine on a cluster that already had the certs
(like the one I was testing on) but failed for everyone else (oops).

The fix is simply to use a non-default checker that knows whether the
certs have been generated, and avoids accidentally starting the webhook
server if the certs don't exist yet.

Tested: manually on a cluster without the Secret. Without this change, I
can see the error from the webhook server complaining that the certs
don't exist; with this change, I can see that that in HNC's first
invocation, we get some "healthz check failed" messages, but the cert
controller runs, generates the certs, and restarts HNC. The second
invocation works just fine.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 24, 2022
@adrianludwin adrianludwin added this to the release-v1.0 milestone Mar 24, 2022
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 24, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianludwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from srampal and tashimi March 24, 2022 03:59
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2022
@adrianludwin
Copy link
Contributor Author

/assign @rjbez17
/cc @erikgb

@erikgb
Copy link
Contributor

erikgb commented Mar 24, 2022

While this solution seems ok, I think the canonical solution would be to start with a dummy certificate that later will be replaced by the certificate-provider. Just to solve the bootstrapping problem. That is how I would solve this using cert-manager. But I also think certificates should be provided to the application, and not provided by the application.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 24, 2022
@k8s-ci-robot k8s-ci-robot merged commit 921b678 into kubernetes-sigs:master Mar 24, 2022
@erikgb
Copy link
Contributor

erikgb commented Mar 24, 2022

Oooops, I didn't think my lgtm would merge this. Sorry! I should have added a hold to get some discussions around my comment.

@rjbez17
Copy link
Contributor

rjbez17 commented Mar 24, 2022

I wonder if it may be worthwhile to reconsider removing the cert creation logic altogether from startup? Just have a dependency that certs are provided. We take in a secret and k8s will prevent us from starting until the secret exists. Then we can have a separate (optional) startup job that does any and all cert creation logic or leave it to users to not run the job if they have their own solution (cert manager). I know we discussed this in the past but I don't remember why we didn't like that approach? To handle cert rotations the job could be a cron job for self signed and cert manager handles for others. We just need a watcher on the mounted secret file to reload HNC on change.

This would guarantee we have a cert at startup.

@adrianludwin adrianludwin deleted the fix-probes branch March 24, 2022 14:25
@adrianludwin
Copy link
Contributor Author

adrianludwin commented Mar 24, 2022 via email

@erikgb
Copy link
Contributor

erikgb commented Mar 24, 2022

I agree with @rjbez17. The current approach will complicate the code more over time, and put HNC further away from controller-runtime and kubebuilder. I'll vote for cert-manager as the default - as suggested by kubebuilder in the skaffolding process. And eventually an opt-in for cert-rotator, that could be used in tests - with some test adjustments. Or just use a self-signed cert in the tests and drop the cert-rotator.

@adrianludwin
Copy link
Contributor Author

adrianludwin commented Mar 24, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Getting error "tls.crt: no such file or directory" when building HNC from main branch
4 participants