Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterResourceSet could take up to 16m40s to reconcile again due to locked ClusterCache Tracker #9775

Closed
chrischdi opened this issue Nov 27, 2023 · 5 comments · Fixed by #9777
Labels
area/clustercachetracker Issues or PRs related to the clustercachetracker area/clusterresourceset Issues or PRs related to clusterresourcesets kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@chrischdi
Copy link
Member

What steps did you take and what happened?

The ClusterResourceSet controller repeatedly hits this line and returns with ctrl.Result{Requeue: true}:

https://github.com/kubernetes-sigs/cluster-api/blob/main/exp/addons/internal/controllers/clusterresourceset_controller.go#L169

Because the kube-apiserver takes some time to get get reachable: this return gets hit multiple times and because it ends in a Rate Limited Queue it builds up to the maximum requeue time of 1000s (16m40s).

This could also happen in other controllers where we do also return ctrl.Result{Requeue: true} when hitting the remote.ErrClusterLocked error.

What did you expect to happen?

CRS to rollout as soon as kube-apiserver is available and by that the test to proceed/succeed.

Cluster API version

  • Using v1.6.0-rc.0 for the test.
  • During the workload cluster creation using CAPI v1.5.3

Kubernetes version

v1.28.0

Anything else you would like to add?

Workarounds are:

  • reducing the sync-period for cluster-api-controller-manager (which includes the clusterresourceset controller)
  • increasing the timeout to 20m for the test.

Example of a failed test: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-vsphere/2517/pull-cluster-api-provider-vsphere-e2e-main/1727951432931348480

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 27, 2023
@sbueringer
Copy link
Member

+1

I think in general it's fine to requeue with a constant interval when we hit this specific lock (e.g. 1m) instead of going into an expontential backoff

/triage accepted

/cc @fabriziopandini

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 27, 2023
@sbueringer sbueringer added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. area/clustercachetracker Issues or PRs related to the clustercachetracker area/clusterresourceset Issues or PRs related to clusterresourcesets and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Nov 27, 2023
@sbueringer
Copy link
Member

(would be nice to get this fixed for v1.6.0)

@chrischdi
Copy link
Member Author

The exponential requeue delay gets added because it returns ctrl.Result{Requeue: true} because this then calls c.Queue.AddRateLimited:

https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/internal/controller/controller.go#L341

Using ctrl.Result{RequeueAfter: <duration>} would not lead to the exponential rate-limit backoff.

@sbueringer
Copy link
Member

@chrischdi Can we please follow-up and do the same for all other controllers and CAPV?

@chrischdi
Copy link
Member Author

chrischdi commented Dec 5, 2023

Other places where we return the same:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustercachetracker Issues or PRs related to the clustercachetracker area/clusterresourceset Issues or PRs related to clusterresourcesets kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants