ClusterResourceSet could take up to 16m40s to reconcile again due to locked ClusterCache Tracker #9775

chrischdi · 2023-11-27T14:45:37Z

What steps did you take and what happened?

Developed clusterctl upgrade tests for CAPV.
During installation of a workload cluster, it times out (10m timeout) while waiting for the machines to have a .status.nodeRef
- https://github.com/kubernetes-sigs/cluster-api/blob/main/test/e2e/clusterctl_upgrade.go#L410-L429
This is because the ClusterResourceSet which contains the cloud-controller-manager did not get rolled out before hitting the timeout

The ClusterResourceSet controller repeatedly hits this line and returns with ctrl.Result{Requeue: true}:

https://github.com/kubernetes-sigs/cluster-api/blob/main/exp/addons/internal/controllers/clusterresourceset_controller.go#L169

Because the kube-apiserver takes some time to get get reachable: this return gets hit multiple times and because it ends in a Rate Limited Queue it builds up to the maximum requeue time of 1000s (16m40s).

This could also happen in other controllers where we do also return ctrl.Result{Requeue: true} when hitting the remote.ErrClusterLocked error.

What did you expect to happen?

CRS to rollout as soon as kube-apiserver is available and by that the test to proceed/succeed.

Cluster API version

Using v1.6.0-rc.0 for the test.
During the workload cluster creation using CAPI v1.5.3

Kubernetes version

v1.28.0

Anything else you would like to add?

Workarounds are:

reducing the sync-period for cluster-api-controller-manager (which includes the clusterresourceset controller)
increasing the timeout to 20m for the test.

Example of a failed test: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-vsphere/2517/pull-cluster-api-provider-vsphere-e2e-main/1727951432931348480

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

sbueringer · 2023-11-27T14:54:41Z

+1

I think in general it's fine to requeue with a constant interval when we hit this specific lock (e.g. 1m) instead of going into an expontential backoff

/triage accepted

/cc @fabriziopandini

sbueringer · 2023-11-27T14:55:34Z

(would be nice to get this fixed for v1.6.0)

chrischdi · 2023-11-27T14:57:41Z

The exponential requeue delay gets added because it returns ctrl.Result{Requeue: true} because this then calls c.Queue.AddRateLimited:

https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/internal/controller/controller.go#L341

Using ctrl.Result{RequeueAfter: <duration>} would not lead to the exponential rate-limit backoff.

sbueringer · 2023-12-05T07:54:24Z

@chrischdi Can we please follow-up and do the same for all other controllers and CAPV?

chrischdi · 2023-12-05T08:43:42Z

Other places where we return the same:

cluster-api/bootstrap/kubeadm/internal/controllers/kubeadmconfig_controller.go

Line 242 in 47bfc14

return ctrl.Result{Requeue: true}, nil
cluster-api/controlplane/kubeadm/internal/controllers/controller.go

Line 246 in 47bfc14

return ctrl.Result{Requeue: true}, nil
cluster-api/controlplane/kubeadm/internal/controllers/controller.go

Line 257 in 47bfc14

return ctrl.Result{Requeue: true}, nil
cluster-api/exp/internal/controllers/machinepool_controller.go

Line 200 in 47bfc14

return ctrl.Result{Requeue: true}, nil
cluster-api/exp/internal/controllers/machinepool_controller.go

Line 218 in 47bfc14

return ctrl.Result{Requeue: true}, nil
cluster-api/internal/controllers/machine/machine_controller.go

Line 209 in 47bfc14

return ctrl.Result{Requeue: true}, nil
cluster-api/internal/controllers/machine/machine_controller.go

Line 227 in 47bfc14

return ctrl.Result{Requeue: true}, nil
cluster-api/internal/controllers/machine/machine_controller.go

Line 609 in 47bfc14

return ctrl.Result{Requeue: true}, nil
cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_controller.go

Line 179 in 47bfc14

return ctrl.Result{Requeue: true}, nil
cluster-api/internal/controllers/machineset/machineset_controller.go

Line 187 in 47bfc14

return ctrl.Result{Requeue: true}, nil
cluster-api/internal/controllers/topology/cluster/cluster_controller.go

Line 212 in 47bfc14

return ctrl.Result{Requeue: true}, nil
cluster-api/test/infrastructure/docker/exp/internal/controllers/dockermachinepool_controller.go

Line 161 in 47bfc14

return ctrl.Result{Requeue: true}, nil
cluster-api/test/infrastructure/docker/internal/controllers/dockermachine_controller.go

Line 198 in 47bfc14

return ctrl.Result{Requeue: true}, nil

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 27, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 27, 2023

chrischdi mentioned this issue Nov 27, 2023

🐛 clusterresourceset: requeue after 1 minute if ErrClusterLocked got hit #9777

Merged

k8s-ci-robot closed this as completed in #9777 Nov 28, 2023

chrischdi mentioned this issue Dec 5, 2023

🐛 ClusterCacheTracker: Use RequeueAfter instead of immediate requeue on ErrClusterLocked to not have exponentially increasing requeue time #9810

Merged

chrischdi added a commit to chrischdi/cluster-api-provider-vsphere that referenced this issue Dec 12, 2023

bump capi/test dependency to fix kubernetes-sigs/cluster-api#9775

5faf160

chrischdi mentioned this issue Dec 12, 2023

🌱 Introduce a new go.mod for test/ kubernetes-sigs/cluster-api-provider-vsphere#2532

Merged

chrischdi mentioned this issue Feb 16, 2024

[WIP][DoNotReview] 🌱 test old capi+capv and k8s version kubernetes-sigs/cluster-api-provider-vsphere#2747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterResourceSet could take up to 16m40s to reconcile again due to locked ClusterCache Tracker #9775

ClusterResourceSet could take up to 16m40s to reconcile again due to locked ClusterCache Tracker #9775

chrischdi commented Nov 27, 2023

sbueringer commented Nov 27, 2023

sbueringer commented Nov 27, 2023

chrischdi commented Nov 27, 2023

sbueringer commented Dec 5, 2023

chrischdi commented Dec 5, 2023 •

edited

Loading

ClusterResourceSet could take up to 16m40s to reconcile again due to locked ClusterCache Tracker #9775

ClusterResourceSet could take up to 16m40s to reconcile again due to locked ClusterCache Tracker #9775

Comments

chrischdi commented Nov 27, 2023

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

sbueringer commented Nov 27, 2023

sbueringer commented Nov 27, 2023

chrischdi commented Nov 27, 2023

sbueringer commented Dec 5, 2023

chrischdi commented Dec 5, 2023 • edited Loading

chrischdi commented Dec 5, 2023 •

edited

Loading