Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Make ClusterResourceSet controller more predictable #10869

Conversation

fabriziopandini
Copy link
Member

@fabriziopandini fabriziopandini commented Jul 15, 2024

What this PR does / why we need it:
ClusterResource set controller doesn't have an ideal implementation for cases where many ClusterResourceSets are targeting the same cluster (might be also the API is not specifically designed for this use case).

This PR is making this case a little bit more predictable, by avoiding a potential impact of exponential backoff delay quickly growing, but it is not solving the underlying issue.

At least, this should help in getting rid of some flakiness in our tests, and there is also a comment providing some more context on what's going on.

Another positive effect of this PR is that it remove a lot of noise from the logs

/area clusterresourceset

Related to #10854

@k8s-ci-robot k8s-ci-robot added area/clusterresourceset Issues or PRs related to clusterresourcesets cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 15, 2024
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 15, 2024
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 16, 2024
@fabriziopandini fabriziopandini added the tide/merge-method-merge Denotes a PR that should use a standard merge by tide when it merges. label Jul 16, 2024
@fabriziopandini
Copy link
Member Author

fabriziopandini commented Jul 16, 2024

Together with @sbueringer we did investingate this a little bit more. There are a few things at play:

  1. How the test was written
  • The test "Should handle applying multiple ClusterResourceSets concurrently to the same cluster" was re-using the same config map and secrets for 10 CRS (which does not makes sense in a real configuration)
  • When one of the CRS object reconciles, it adds owner refs to configMap and secrets, and watches on those objects triggers reconciles for all 10 CRS. This generates storm of events making the test unstable.
  • I have added a commit to create 1 CM/1 Secret for each CRS
  1. code dependencies with Cluster Cache Tracker.
  • The reconcile func is getting from Cluster Cache Tracker a remote client to apply CRS
  • At the beginning of the tests, when all the CRS objects reconciles toghether on a newly created cluster, it might happen that all the CRS objects except the first will get a lock error when trying to acquire a client from Cluster Cache Tracker
  • In case of lock errors, there is a requeue after 1 minute and test will fails (flake)
  • The new commit makes this (and other tests) more stable by creating a remote client in the setup phase (before all the reconcile).
  1. Test dependency with a remote client performances
  • The key to avoid conficts when updating CRS bindings is to keep reconcile duration short.
  • By default, the remote client used to apply CRS has a verty small QPS and burst.
  • With 10 CRS reconciling in parallel and all using the same remote client, throttling was kicking in, making the test duration unpredictable.
  • @sbueringer is working to a separate PR to make this configurable

We should wait for @sbueringer PR to merge before checking again if this fix the issue

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits, otherwise lgtm

@sbueringer
Copy link
Member

The separate PR mentioned above is open now (#10880)

@fabriziopandini
Copy link
Member Author

/hold for #10880

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 16, 2024
@sbueringer
Copy link
Member

Thx!

/lgtm
/approve

(keeping the hold for the other PR and thus also until after the beta tag is created)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 16, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: d5f1201126700ccde19cea8527f7cac1a67a8734

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 16, 2024
@sbueringer
Copy link
Member

@fabriziopandini my PR is merged, you have to rebase if you want to run the test locally though

@sbueringer
Copy link
Member

/test pull-cluster-api-test-main

1 similar comment
@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-test-main

@sbueringer
Copy link
Member

sbueringer commented Jul 17, 2024

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api/10869/pull-cluster-api-test-main/1813476101129768960 is a new flake independent of your PR (flaky on main, fails 100% on test-mink8s)

@sbueringer
Copy link
Member

With the qps changes one run took 6s and another 7s. That is roughly what we expected (and far away from the 1m timeout)

@jimmidyson
Copy link
Member

Thank you so much @fabriziopandini and @sbueringer for looking at this! Sorry for introducing this test flakiness while fixing the previous bug 😞

@jimmidyson
Copy link
Member

/lgtm

Copy link
Member

@chrischdi chrischdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@sbueringer
Copy link
Member

/hold cancel

Let's keep an eye on the test. I think it should not be flaky anymore (even with the current timeout)

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 17, 2024
@k8s-ci-robot k8s-ci-robot merged commit 94bfff5 into kubernetes-sigs:main Jul 17, 2024
21 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.8 milestone Jul 17, 2024
@jimmidyson
Copy link
Member

Can we backport this to 1.7 branch too please?

@jimmidyson
Copy link
Member

/cherrypick release-1.7

@k8s-infra-cherrypick-robot

@jimmidyson: #10869 failed to apply on top of branch "release-1.7":

Applying: Make ClusterResourceSet controller more predictable
Applying: Improve TestClusterResourceSetReconciler
Using index info to reconstruct a base tree...
M	exp/addons/internal/controllers/clusterresourceset_controller_test.go
M	exp/addons/internal/controllers/suite_test.go
M	internal/test/envtest/environment.go
Falling back to patching base and 3-way merge...
Auto-merging internal/test/envtest/environment.go
Auto-merging exp/addons/internal/controllers/suite_test.go
CONFLICT (content): Merge conflict in exp/addons/internal/controllers/suite_test.go
Auto-merging exp/addons/internal/controllers/clusterresourceset_controller_test.go
CONFLICT (content): Merge conflict in exp/addons/internal/controllers/clusterresourceset_controller_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 Improve TestClusterResourceSetReconciler
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer
Copy link
Member

Can we backport this to 1.7 branch too please?

Yup, that's fine. Just needs a manual cherry-pick. Maybe you can check for other PRs we had in this area and that should probably also be cherry-picked (I think there were only 1-2 since your PR merged)

@jimmidyson
Copy link
Member

@sbueringer Looks like we'd need to cherry-pick ##10756, #10880, and this one. Does that sound right to you?

jimmidyson pushed a commit to jimmidyson/cluster-api that referenced this pull request Jul 17, 2024
…usterresourceset-more-predictable

🌱 Make ClusterResourceSet controller more predictable
@sbueringer
Copy link
Member

Sounds reasonable!

@fabriziopandini fabriziopandini deleted the make-clusterresourceset-more-predictable branch July 18, 2024 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/clusterresourceset Issues or PRs related to clusterresourcesets cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-merge Denotes a PR that should use a standard merge by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants