Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Allow Machines in unreachable Clusters to do initial reconciliation #7719

Conversation

killianmuldoon
Copy link
Contributor

@killianmuldoon killianmuldoon commented Dec 9, 2022

Signed-off-by: killianmuldoon kmuldoon@vmware.com

This change ignores errors in Machine reconciliation when the ClusterCacheTracker is not able to provide a client. This allows additional parts of the reconciliation - setting ownerReferences, reconciling external Bootstrap and Infrastructure objects - to be done before failing when the Machine controller is unable to get a lock on the ClusterAccessor, or the Cluster is uncontactable in some way.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 9, 2022
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 9, 2022
@killianmuldoon killianmuldoon force-pushed the pr-reconcile-uncontactable-machines branch from cb7da5e to 99dad5f Compare December 9, 2022 13:39
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 9, 2022
@@ -44,6 +44,11 @@ var (
func (r *Reconciler) reconcileNode(ctx context.Context, cluster *clusterv1.Cluster, machine *clusterv1.Machine) (ctrl.Result, error) {
log := ctrl.LoggerFrom(ctx)

// Create a watch on the nodes in the Cluster.
if err := r.watchClusterNodes(ctx, cluster); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini I think we should be alright to add the tracker where it's needed instead of ignoring the error earlier in the reconcile.

Alternatively we could ignore the error but check in this function if the watch is actually set and exit early if not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong objections to this specific change

However, as a follow up, I think we should probably think more holistically about how failures on getting a client from the tracker are handled, because now all the reconciler trying to get an accessor currently locked goes into exponential backoff, and this has two downsides:

  • we have a spike of errors on the logs
  • exponential backoff grows fast, quickly going over the resync period.

I can see two options here:

  • a quick win, which is requeing after the timeout for creating a new accessor
  • a better solution, where the cluster tracker enquees events for the reconcilers when a new accessor is created.

@sbueringer opinions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a spike of errors on the logs

I think the exponential backoff leads to the reverse, less log messages. If we reconcile more often we get more log mesages (btw we log them as info today on log level 5, not as an error).

exponential backoff grows fast, quickly going over the resync period.

But if the exponential backoff is higher than the resync period we still get a reconcile after the resync period, right?
Usually exponential backoffs have a maximum, what is the maximum in CR?

a quick win, which is requeing after the timeout for creating a new accessor

Just to clarify, instead of ctrl.Result{Requeue: true} (which leads to the exponential backoff) we would return ctrl.Result{Requeue: true, RequeueAfter: 10 * time.Second} (I think 10s is our timeout) when we get the ErrClusterLocked error?

Sounds fine to me. I'm personally not concerned about getting that info message every 10 seconds on log level 5 during the time when the accessor is (usually) initially created.

a better solution, where the cluster tracker enques events for the reconcilers when a new accessor is created.

This means the cluster tracker has to keep track of which reconciler asked for an accessor for a specific cluster while reconciling a certain object and then triggger reconciles for all those objects once the accessor has been created?

To be honest, I would go with reconciling more often. This seems too complex to me if the main benefit would be that we have less log messages on log level 5. (Might be I"m missing something)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in this PR looks good to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default max exponential backoff is ~16 minutes 45 seconds for controller runtime, and we currently use that default across our controllers. I'm going to and open an issue that points to this thread so we can discuss this in a more visible place.

I do think we should be concerned with how often we return errors/requeues and resort to exponential backoff when it comes to total time for operations to complete vs stability. There may be cases where we want to err on the side of less or more frequent reconciles depending on the operation.

@fabriziopandini
Copy link
Member

/hold

@aljoshare
Copy link
Contributor

Hi @killianmuldoon, the PR could be related to the problem I discussed with you in Slack. I investigated the problem yesterday and what we saw is that new nodes don't get recognized by the capi-controller-manager after the external config gets invalid (approx. after 15 minutes). The log shows that there are many failed attempts to create a new cluster accessor:

     1 machineset_controller.go:751] "Unable to retrieve Node status" err="failed to create cluster accessor: failed to get lock for cluster: cluster is locked already" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="cluster-*****/workers-persistence-0-66c867b8df" namespace="cluster-*****" name="workers-persistence-0-66c867b8df" reconcileID=fbaacbbc-6657-48f8-8aa2-5199049696f8 MachineDeployment="cluster-******/workers-persistence-0" Cluster="*******" Machine="cluster-******/workers-persistence-0-66c867b8df-krb22" node=""
E1213 13:18:19.054658  

If we understand it correctly, a failed attempt gets ignored and the corresponding node will not be reconciled, so maybe this PR will fix our issue as well. A thing we don't understand is the use of TryLock:

// clusterAccessor doesn't exist yet, we might have to initialize one.
// Lock on the cluster to ensure only one clusterAccessor is initialized
// for the cluster at the same time.
// Return an error if another go routine already tries to create a clusterAccessor.
if ok := t.clusterLock.TryLock(cluster); !ok {
return nil, errors.Wrapf(ErrClusterLocked, "failed to create cluster accessor: failed to get lock for cluster")
}

Could you explain why you decided to use try lock and ignore instead of try lock and wait? We ask ourselves what would happen if the first of multiple concurring attempts takes longer? Then all following attempts would fail, right? Is it possible that this can happen in every reconciliation, so that node reconciliation loops will never get a cluster accessor? That would explain our observations, where machines get stuck in the provisioned state.

If you need more information (e.g. logs/stack trace), just contact me :-) Thank you very much!

Signed-off-by: killianmuldoon <kmuldoon@vmware.com>
@killianmuldoon killianmuldoon force-pushed the pr-reconcile-uncontactable-machines branch from 99dad5f to 68763b7 Compare January 25, 2023 12:56
@killianmuldoon killianmuldoon changed the title [WIP] 🌱 Allow Machines in unreachable Clusters to do initial reconciliation 🌱 Allow Machines in unreachable Clusters to do initial reconciliation Jan 25, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 25, 2023
@killianmuldoon
Copy link
Contributor Author

/retest

@sbueringer
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 25, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 0a22e2a987d8976a934a85fe491d102b7d98af94

@fabriziopandini
Copy link
Member

/lgtm
/approve

+1 also to open an issue about possible improvements on how we handle tracker locks errors

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 25, 2023
@k8s-ci-robot k8s-ci-robot merged commit 610b27c into kubernetes-sigs:main Jan 25, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.4 milestone Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants