Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Increase timeout for TestClusterResourceSetReconciler test #10868

Conversation

Sunnatillo
Copy link
Contributor

What this PR does / why we need it:
This PR fixes the flaky test: TestClusterResourceSetReconciler/Should_handle_applying_multiple_ClusterResourceSets_concurrently_to_the_same_cluster. This test was failing due to timeout issue:

 clusterresourceset_controller_test.go:1068: 
        Timed out after 60.001s.
        The function passed to Eventually failed at /home/prow/go/src/sigs.k8s.io/cluster-api/exp/addons/internal/controllers/clusterresourceset_controller_test.go:1062 with:
        Expected
            <[]v1beta1.ResourceBinding | len:0, cap:0>: nil
        to have length 2

Increasing timeout helped to solve the issue. The objects will be eventually created if the timeout is bit longer.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes: #10854

Signed-off-by: Sunnatillo <sunnat.samadov@est.tech>
@k8s-ci-robot k8s-ci-robot added area/ci Issues or PRs related to ci cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 15, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign killianmuldoon for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @Sunnatillo. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 15, 2024
@fabriziopandini
Copy link
Member

/ok-to-test

I'm not opposed to this change, but I have created also #10869 to reduce the impact of exponential backoff on the time the controller takes to reach a stable state + reduce noise in the logs

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 15, 2024
@sbueringer
Copy link
Member

sbueringer commented Jul 16, 2024

/hold

I would like to know/test if fabrizio's fix makes this one unnecessary

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 16, 2024
@Sunnatillo
Copy link
Contributor Author

Sunnatillo commented Jul 16, 2024

/hold

I would like to know/test if fabrizio's fix makes this one unnecessary

Most flikely it will help, I will test and confirm after PR gets merged.

@adilGhaffarDev
Copy link
Contributor

I would like to know/test if fabrizio's fix makes this one unnecessary

I have tested locally, the test is failing even after Fabirizio's fix, I ran 50 consecutive tests and it failed after 15-16 I think. it might have decreased the occurrence, But I cannot say for sure before we merge it.

I have also tested Sunnat's fix, it passes 50 consecutive runs. I would say we take in both PRs.

@Sunnatillo
Copy link
Contributor Author

I would like to know/test if fabrizio's fix makes this one unnecessary

I have tested locally, the test is failing even after Fabirizio's fix, I ran 50 consecutive tests and it failed after 15-16 I think. it might have decreased the occurrence, But I cannot say for sure before we merge it.

I have also tested Sunnat's fix, it passes 50 consecutive runs. I would say we take in both PRs.

Thank you for testing.

I also tested with 200 count. The result is same.

@sbueringer
Copy link
Member

sbueringer commented Jul 16, 2024

Maybe we should then decrease the number of clusters. Usually unit tests that take more than a few milliseconds are already bad (we need quick feedback loops when running all unit tests of a repository). Let's say a few seconds are fine because we're talking about controllers with backoffs here. Now we would be up to 2m. This is really really bad

Also really strange that we need more than a minute in some cases with Fabrizio's fix to be honest

@adilGhaffarDev
Copy link
Contributor

Maybe we should then decrease the number of clusters. Usually unit tests that take more than a few milliseconds are already bad (we need quick feedback loops when running all unit tests of a repository). Let's say a few seconds are fine because we're talking about controllers with backoffs here. Now we would be up to 2m. This is really really bad

Did you mean to decrease the number of ClusterResourceSets here:

I tried changing it from 10 to 5 and running tests again, but it was still failing.

I0716 15:48:05.170783   20650 clusterresourceset_controller.go:195] "Conflict in patching a ClusterResourceSetBinding that is updated by more than one ClusterResourceSet, requeing" controller="clusterresourceset" controllerGroup="addons.cluster.x-k8s.io" controllerKind="ClusterResourceSet" ClusterResourceSet="test-cluster-resource-set-6kj9w/clusterresourceset-5q5an2" namespace="test-cluster-resource-set-6kj9w" name="clusterresourceset-5q5an2" reconcileID="6eb8e144-44b7-4fd3-82ab-a665b0900888"
    clusterresourceset_controller_test.go:1068: 
        Timed out after 60.000s.
        The function passed to Eventually failed at /Users/adil/Desktop/fab/cluster-api/exp/addons/internal/controllers/clusterresourceset_controller_test.go:1062 with:
        Expected
            <[]v1beta1.ResourceBinding | len:0, cap:0>: nil
        to have length 2
    panic.go:626: Deleting the Kubeconfigsecret
--- FAIL: TestClusterResourceSetReconciler (60.04s)
    --- FAIL: TestClusterResourceSetReconciler/Should_handle_applying_multiple_ClusterResourceSets_concurrently_to_the_same_cluster (60.04s)
FAIL

@sbueringer
Copy link
Member

It's fine. I was looking into Fabrizio's PR with Fabrizio. We'll make some more improvements (@fabriziopandini will share details)

@sbueringer
Copy link
Member

Just a quick update (for more details please see the comments on #10869)

Overall we feel confident that we fixed the issue in a way that we don't have to increase the timeout anymore. Let's monitor the test for 1-2 days and then close this PR if the test is not flaky anymore (feel free to run local tests again with the latest version of Fabrizio's PR / CAPI main)

@Sunnatillo
Copy link
Contributor Author

/close
Fabrizio's PR solved the issue.

@k8s-ci-robot
Copy link
Contributor

@Sunnatillo: Closed this PR.

In response to this:

/close
Fabrizio's PR solved the issue.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci Issues or PRs related to ci cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestClusterResourceSetReconciler test is flaky
5 participants