🌱 Allow Machines in unreachable Clusters to do initial reconciliation #7719

killianmuldoon · 2022-12-09T13:27:09Z

Signed-off-by: killianmuldoon kmuldoon@vmware.com

This change ignores errors in Machine reconciliation when the ClusterCacheTracker is not able to provide a client. This allows additional parts of the reconciliation - setting ownerReferences, reconciling external Bootstrap and Infrastructure objects - to be done before failing when the Machine controller is unable to get a lock on the ClusterAccessor, or the Cluster is uncontactable in some way.

killianmuldoon · 2022-12-09T13:41:05Z

internal/controllers/machine/machine_controller_noderef.go

@@ -44,6 +44,11 @@ var (
 func (r *Reconciler) reconcileNode(ctx context.Context, cluster *clusterv1.Cluster, machine *clusterv1.Machine) (ctrl.Result, error) {
 	log := ctrl.LoggerFrom(ctx)

+	// Create a watch on the nodes in the Cluster.
+	if err := r.watchClusterNodes(ctx, cluster); err != nil {


@fabriziopandini I think we should be alright to add the tracker where it's needed instead of ignoring the error earlier in the reconcile.

Alternatively we could ignore the error but check in this function if the watch is actually set and exit early if not.

I don't have strong objections to this specific change

However, as a follow up, I think we should probably think more holistically about how failures on getting a client from the tracker are handled, because now all the reconciler trying to get an accessor currently locked goes into exponential backoff, and this has two downsides:

we have a spike of errors on the logs

exponential backoff grows fast, quickly going over the resync period.

I can see two options here:

a quick win, which is requeing after the timeout for creating a new accessor

a better solution, where the cluster tracker enquees events for the reconcilers when a new accessor is created.

@sbueringer opinions?

we have a spike of errors on the logs

I think the exponential backoff leads to the reverse, less log messages. If we reconcile more often we get more log mesages (btw we log them as info today on log level 5, not as an error).

exponential backoff grows fast, quickly going over the resync period.

But if the exponential backoff is higher than the resync period we still get a reconcile after the resync period, right?
Usually exponential backoffs have a maximum, what is the maximum in CR?

a quick win, which is requeing after the timeout for creating a new accessor

Just to clarify, instead of ctrl.Result{Requeue: true} (which leads to the exponential backoff) we would return ctrl.Result{Requeue: true, RequeueAfter: 10 * time.Second} (I think 10s is our timeout) when we get the ErrClusterLocked error?

Sounds fine to me. I'm personally not concerned about getting that info message every 10 seconds on log level 5 during the time when the accessor is (usually) initially created.

a better solution, where the cluster tracker enques events for the reconcilers when a new accessor is created.

This means the cluster tracker has to keep track of which reconciler asked for an accessor for a specific cluster while reconciling a certain object and then triggger reconciles for all those objects once the accessor has been created?

To be honest, I would go with reconciling more often. This seems too complex to me if the main benefit would be that we have less log messages on log level 5. (Might be I"m missing something)

The change in this PR looks good to me

The default max exponential backoff is ~16 minutes 45 seconds for controller runtime, and we currently use that default across our controllers. I'm going to and open an issue that points to this thread so we can discuss this in a more visible place.

I do think we should be concerned with how often we return errors/requeues and resort to exponential backoff when it comes to total time for operations to complete vs stability. There may be cases where we want to err on the side of less or more frequent reconciles depending on the operation.

fabriziopandini · 2022-12-13T16:53:48Z

/hold

aljoshare · 2022-12-14T08:02:10Z

Hi @killianmuldoon, the PR could be related to the problem I discussed with you in Slack. I investigated the problem yesterday and what we saw is that new nodes don't get recognized by the capi-controller-manager after the external config gets invalid (approx. after 15 minutes). The log shows that there are many failed attempts to create a new cluster accessor:

     1 machineset_controller.go:751] "Unable to retrieve Node status" err="failed to create cluster accessor: failed to get lock for cluster: cluster is locked already" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="cluster-*****/workers-persistence-0-66c867b8df" namespace="cluster-*****" name="workers-persistence-0-66c867b8df" reconcileID=fbaacbbc-6657-48f8-8aa2-5199049696f8 MachineDeployment="cluster-******/workers-persistence-0" Cluster="*******" Machine="cluster-******/workers-persistence-0-66c867b8df-krb22" node=""
E1213 13:18:19.054658

If we understand it correctly, a failed attempt gets ignored and the corresponding node will not be reconciled, so maybe this PR will fix our issue as well. A thing we don't understand is the use of TryLock:

cluster-api/controllers/remote/cluster_cache_tracker.go

Lines 215 to 221 in 6a4e90c

    
           // clusterAccessor doesn't exist yet, we might have to initialize one. 
        
           // Lock on the cluster to ensure only one clusterAccessor is initialized 
        
           // for the cluster at the same time. 
        
           // Return an error if another go routine already tries to create a clusterAccessor. 
        
           if ok := t.clusterLock.TryLock(cluster); !ok { 
        
           	return nil, errors.Wrapf(ErrClusterLocked, "failed to create cluster accessor: failed to get lock for cluster") 
        
           }

Could you explain why you decided to use try lock and ignore instead of try lock and wait? We ask ourselves what would happen if the first of multiple concurring attempts takes longer? Then all following attempts would fail, right? Is it possible that this can happen in every reconciliation, so that node reconciliation loops will never get a cluster accessor? That would explain our observations, where machines get stuck in the provisioned state.

If you need more information (e.g. logs/stack trace), just contact me :-) Thank you very much!

Signed-off-by: killianmuldoon <kmuldoon@vmware.com>

killianmuldoon · 2023-01-25T13:08:06Z

/retest

sbueringer · 2023-01-25T13:37:22Z

/lgtm

k8s-ci-robot · 2023-01-25T13:37:27Z

LGTM label has been added.

Git tree hash: 0a22e2a987d8976a934a85fe491d102b7d98af94

fabriziopandini · 2023-01-25T14:42:26Z

/lgtm
/approve

+1 also to open an issue about possible improvements on how we handle tracker locks errors

k8s-ci-robot · 2023-01-25T14:42:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [fabriziopandini]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 9, 2022

k8s-ci-robot requested review from enxebre and stmcginnis December 9, 2022 13:27

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 9, 2022

killianmuldoon force-pushed the pr-reconcile-uncontactable-machines branch from cb7da5e to 99dad5f Compare December 9, 2022 13:39

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 9, 2022

killianmuldoon commented Dec 9, 2022

View reviewed changes

aljoshare mentioned this pull request Jan 24, 2023

Nodes get stuck in provisioned state because of failing refresh of credentials #7989

Closed

Allow Machines in unreachable Clusters to do initial reconciliation

68763b7

Signed-off-by: killianmuldoon <kmuldoon@vmware.com>

killianmuldoon force-pushed the pr-reconcile-uncontactable-machines branch from 99dad5f to 68763b7 Compare January 25, 2023 12:56

killianmuldoon changed the title ~~[WIP] 🌱 Allow Machines in unreachable Clusters to do initial reconciliation~~ 🌱 Allow Machines in unreachable Clusters to do initial reconciliation Jan 25, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 25, 2023

k8s-ci-robot assigned sbueringer Jan 25, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 25, 2023

k8s-ci-robot assigned fabriziopandini Jan 25, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 25, 2023

k8s-ci-robot merged commit 610b27c into kubernetes-sigs:main Jan 25, 2023

k8s-ci-robot added this to the v1.4 milestone Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 Allow Machines in unreachable Clusters to do initial reconciliation #7719

🌱 Allow Machines in unreachable Clusters to do initial reconciliation #7719

killianmuldoon commented Dec 9, 2022 •

edited

Loading

killianmuldoon Dec 9, 2022

fabriziopandini Jan 25, 2023

sbueringer Jan 25, 2023

sbueringer Jan 25, 2023

killianmuldoon Jan 25, 2023

fabriziopandini commented Dec 13, 2022

aljoshare commented Dec 14, 2022

killianmuldoon commented Jan 25, 2023

sbueringer commented Jan 25, 2023

k8s-ci-robot commented Jan 25, 2023

fabriziopandini commented Jan 25, 2023

k8s-ci-robot commented Jan 25, 2023

🌱 Allow Machines in unreachable Clusters to do initial reconciliation #7719

🌱 Allow Machines in unreachable Clusters to do initial reconciliation #7719

Conversation

killianmuldoon commented Dec 9, 2022 • edited Loading

killianmuldoon Dec 9, 2022

Choose a reason for hiding this comment

fabriziopandini Jan 25, 2023

Choose a reason for hiding this comment

sbueringer Jan 25, 2023

Choose a reason for hiding this comment

sbueringer Jan 25, 2023

Choose a reason for hiding this comment

killianmuldoon Jan 25, 2023

Choose a reason for hiding this comment

fabriziopandini commented Dec 13, 2022

aljoshare commented Dec 14, 2022

killianmuldoon commented Jan 25, 2023

sbueringer commented Jan 25, 2023

k8s-ci-robot commented Jan 25, 2023

fabriziopandini commented Jan 25, 2023

k8s-ci-robot commented Jan 25, 2023

killianmuldoon commented Dec 9, 2022 •

edited

Loading