Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Only refresh bootstrap token if needed, requeue in all cases where node hasn't joined yet #9229

Conversation

AndiDog
Copy link
Contributor

@AndiDog AndiDog commented Aug 17, 2023

What this PR does / why we need it:

Right now, the bootstrap token is refreshed every time until nodes have joined. This creates unnecessary etcd writes. Only extend the expiry timestamp (called "refresh" in code) if we're getting a bit closer to reaching it. Particularly if KubeadmConfig reconciliation is triggered from other sources than CAPA's RequeueAfter interval, this can pile up to quite noisy logs and API accesses.

Along the way, I did some other improvements and will add inline comments to explain.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

This was a drive-by along the way of fixing giantswarm/roadmap#2217, but isn't really related.

/area provider/bootstrap-kubeadm
/area testing

@k8s-ci-robot k8s-ci-robot added area/provider/bootstrap-kubeadm Issues or PRs related to CAPBK area/testing Issues or PRs related to testing cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 17, 2023
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 17, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @AndiDog. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 17, 2023
@sbueringer
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 17, 2023
return ctrl.Result{}, errors.Wrapf(err, "can't parse expiration time of bootstrap token")
}

if expiration.After(time.Now().UTC().Add(r.TokenTTL * 5 / 6)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refresh if 1/6-th of the TTL has passed. We can still change this, but since RequeueAfter is now consistently r.TokenTTL / 3 (which I consider a sane value even for intermittent errors), the next regular reconciliation would refresh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a code comment here that explains this

Since this value needs to be tightly coupled with tokenCheckRefreshOrRotationInterval, we might want to either define them in the same place in the code or make them depend on each other so we don't inadvertently change one but not the other in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a helper function so this is found next to func tokenCheckRefreshOrRotationInterval, and added comments how both are related

return ctrl.Result{}, errors.Wrapf(err, "failed to refresh bootstrap token")
}
return ctrl.Result{
RequeueAfter: r.TokenTTL / 2,
RequeueAfter: r.TokenTTL / 3,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency. Also, if we only reconcile twice within the TTL, the first reconciliation at halftime (TTL / 2) must not fail. Better give more chances if there are intermittent errors.

@@ -554,6 +584,9 @@ func (r *KubeadmConfigReconciler) joinWorker(ctx context.Context, scope *Scope)
if res, err := r.reconcileDiscovery(ctx, scope.Cluster, scope.Config, certificates); err != nil {
return ctrl.Result{}, err
} else if !res.IsZero() {
if res.RequeueAfter == 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the below code line change were IMO missing cases where the reconciler returned that no requeueing is needed, but we need that while nodes haven't joined since the token may expire. It's covered in a changed test, so please double-check if that makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change is necessary.
reconcileDiscovery takes care of creating the token, it is not linked to the refresh token process

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It creates the bootstrap token (via the function createToken) and sets expiry like this:

bootstrapapi.BootstrapTokenExpirationKey:       []byte(time.Now().UTC().Add(ttl).Format(time.RFC3339)),

So as soon as that happened, we need to periodically refresh. But since we check for !res.IsZero() (= "does the result requeue?"), I also think this change doesn't really make sense at this location. Removing it.

@@ -634,7 +667,7 @@ func (r *KubeadmConfigReconciler) joinWorker(ctx context.Context, scope *Scope)
scope.Error(err, "Failed to store bootstrap data")
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
return ctrl.Result{RequeueAfter: r.TokenTTL / 3}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are changing this here, please add a comment explaining why we are requeing, eg.
// joinWorker completed, ensuring to reconcile this object again so we keep refreshing the bootstrap token until it is consumed

Also the same change should be applied to joinControlPlane

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both done 👌

@@ -739,7 +771,8 @@ func (r *KubeadmConfigReconciler) joinControlplane(ctx context.Context, scope *S
return ctrl.Result{}, err
}

return ctrl.Result{}, nil
// Ensure reconciling this object again so we keep refreshing the bootstrap token until it is consumed
return ctrl.Result{RequeueAfter: r.TokenTTL / 3}, nil
Copy link
Member

@enxebre enxebre Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: given this is using r.TokenTTL / 3 in multiple places I'd put it in a var e.g tokenRequeueAfter that is invoked everywhere so there's no chance to skew in future changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@AndiDog
Copy link
Contributor Author

AndiDog commented Aug 23, 2023

This API diff was detected as test error:

NewTestClusterCacheTracker: changed from func(github.com/go-logr/logr.Logger, sigs.k8s.io/controller-runtime/pkg/client.Client, *k8s.io/apimachinery/pkg/runtime.Scheme, k8s.io/apimachinery/pkg/types.NamespacedName, ...string) *ClusterCacheTracker to func(github.com/go-logr/logr.Logger, sigs.k8s.io/controller-runtime/pkg/client.Client, sigs.k8s.io/controller-runtime/pkg/client.Client, *k8s.io/apimachinery/pkg/runtime.Scheme, k8s.io/apimachinery/pkg/types.NamespacedName, ...string) *ClusterCacheTracker

What do I need to do here? I changed the function signature on purpose.

@AndiDog
Copy link
Contributor Author

AndiDog commented Sep 4, 2023

@enxebre @fabriziopandini The requested changes are included. We found in chat that the apidiff test error is not mandatory, so I think we should go ahead since only a test function signature was changed. If that change makes sense, please have a look since I think this PR is ready now.

@fabriziopandini
Copy link
Member

changes looks good to me, but it would be great if someone else give a look at this

/test pull-cluster-api-e2e-full-main

@AndiDog
Copy link
Contributor Author

AndiDog commented Sep 13, 2023

changes looks good to me, but it would be great if someone else give a look at this

@enxebre Would you like to have a closer look?

@AndiDog
Copy link
Contributor Author

AndiDog commented Sep 27, 2023

@enxebre @JoelSpeed could you review this one? Thank you.

@neolit123
Copy link
Member

neolit123 commented Oct 12, 2023

security people might not agree with keeping bootstrap tokens for longer periods of time
kubernetes/kubernetes#119639

if CAPI is keeping them as a maximum of 24h, sgtm.

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall, a few minor comments


if expiration.After(time.Now().UTC().Add(r.TokenTTL * 5 / 6)) {
// We still have lots of time until it expires
log.Info("Token needs no refresh")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to log how much time is left before expiration, maybe as a higher verbosity for debugging purposes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to V(2) and added tokenExpiresInSeconds log field

return ctrl.Result{}, errors.Wrapf(err, "can't parse expiration time of bootstrap token")
}

if expiration.After(time.Now().UTC().Add(r.TokenTTL * 5 / 6)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a code comment here that explains this

Since this value needs to be tightly coupled with tokenCheckRefreshOrRotationInterval, we might want to either define them in the same place in the code or make them depend on each other so we don't inadvertently change one but not the other in the future.

@AndiDog
Copy link
Contributor Author

AndiDog commented Oct 16, 2023

security people might not agree with keeping bootstrap tokens for longer periods of time kubernetes/kubernetes#119639

if CAPI is keeping them as a maximum of 24h, sgtm.

We have refresh (token does not expire while the node didn't join) vs. rotation (machine pools only – keep creating a new token so that new nodes can start up at any time). There's no time limit for the refresh case, though. If we think this is problematic, we should tackle it in a separate issue. My PR only fixes the constant refreshing (= changing expiration value on the Secret) on every reconciliation, which is really just a minor bug and not changing the security stance, I think.

Copy link
Contributor Author

@AndiDog AndiDog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed the review comments and also tested the change manually.

return ctrl.Result{}, errors.Wrapf(err, "can't parse expiration time of bootstrap token")
}

if expiration.After(time.Now().UTC().Add(r.TokenTTL * 5 / 6)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a helper function so this is found next to func tokenCheckRefreshOrRotationInterval, and added comments how both are related


if expiration.After(time.Now().UTC().Add(r.TokenTTL * 5 / 6)) {
// We still have lots of time until it expires
log.Info("Token needs no refresh")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to V(2) and added tokenExpiresInSeconds log field

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @fabriziopandini @enxebre

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 14, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: b8685357062becbf5762a0b9443cf49999eb7c98


secretExpiration := bootstrapsecretutil.GetData(secret, bootstrapapi.BootstrapTokenExpirationKey)
if secretExpiration == "" {
log.Info(fmt.Sprintf("Token has no valid value for %s, writing new expiration timestamp", bootstrapapi.BootstrapTokenExpirationKey))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case would the expiration be empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be, as of CAPA code. However external operators and humans interact with Kubernetes as well, so this is regular error handling code for the mere theoretical possibility of this happening.

now := time.Now().UTC()
skipTokenRefreshIfExpiringAfter := now.Add(r.skipTokenRefreshIfExpiringAfter())
if expiration.After(skipTokenRefreshIfExpiringAfter) {
// We still have lots of time until it expires
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says we have "lots of time", although the check above only checks that the expiration is within the window

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"lots" seems misleading indeed. It's rather "enough time as of call to skipTokenRefreshIfExpiringAfter()". The log message provides enough code explanation here, so I removed the comment.

// skipTokenRefreshIfExpiringAfter returns a duration from "now". If the token's expiry timestamp is after
// `now + skipTokenRefreshIfExpiringAfter()`, it does not yet need a refresh.
func (r *KubeadmConfigReconciler) skipTokenRefreshIfExpiringAfter() time.Duration {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the name/comment a leftover? The function just returns a duration, rather than a duration from now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jan 12, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 15, 2024
@AndiDog
Copy link
Contributor Author

AndiDog commented Jan 15, 2024

@vincepri Please have a look. I've pushed merge+changes so you can still see in GitHub what exactly has changed (hopefully). I can squash to a single commit once all looks okay.

@AndiDog AndiDog requested a review from vincepri January 15, 2024 09:10
@vincepri
Copy link
Member

ok good to squash for me!

@AndiDog AndiDog force-pushed the refresh-bootstrap-token-secret-only-if-needed branch from 563213a to 80cc163 Compare January 15, 2024 17:07
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@sbueringer do you want to give it a pass?

@sbueringer
Copy link
Member

Yup would be good. I hope I get around to it in the next few days

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

But not exactly an expert on this part of our code

/assign @fabriziopandini

@AndiDog AndiDog force-pushed the refresh-bootstrap-token-secret-only-if-needed branch from b620426 to 9529aea Compare January 31, 2024 14:06
@AndiDog
Copy link
Contributor Author

AndiDog commented Jan 31, 2024

@sbueringer I applied both suggestions.

@k8s-ci-robot
Copy link
Contributor

@AndiDog: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-apidiff-main 9529aea link false /test pull-cluster-api-apidiff-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sbueringer
Copy link
Member

Thx!

/lgtm

/assign @fabriziopandini

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 31, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 814237ec9de63ed8a38fa634944394d546e63898

@sbueringer
Copy link
Member

/test pull-cluster-api-e2e-main

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2024
@k8s-ci-robot k8s-ci-robot merged commit e573819 into kubernetes-sigs:main Feb 12, 2024
19 of 21 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.7 milestone Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/bootstrap-kubeadm Issues or PRs related to CAPBK area/testing Issues or PRs related to testing cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants