-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestClusterResourceSetReconciler test is flaky #10854
Comments
cc @fabriziopandini Should be the same as the one I mentioned last time |
@Sunnatillo I think the link is pointing to the wrong job |
Fixed now. |
@jimmidyson PTAL It is important to nail this down before release |
This page here should allow you to go further back: As alternative, you could try to filter at k8s-triage |
Thanks @chrischdi! I've gone back to the time of merge of #10656 and I can only see failures for this test after #10756 was merged so I'm pretty sure that introduced the flakiness. I'll take a closer look if I can find time to try to help figure out what's going on. |
Increasing timeout helps to solve the issues. |
The fact that a ClusterResourceSet binding takes so long to reach a stable state isn't ideal. The issue is on the fact that we are re-queuing in case of API conflicts, and then next reconciliations are influenced by exponential backoff delay quicky growing (+ the other side of the coin, that many reconcilation are happening in a very short sequence at the beginning of the backoff sequence). TL;DR exponential backoff should be used to handle errors, not to handle how controllers are reaching a stable state I have submitted #10869 to get rid of the exponential backoff and all of the noise that API conflict were adding to the logs + documented the problem But also in this case (like for the timeout increase on the test) this is a mitigation. |
Let's see if the test is stable now and then close the issue in 1-2 days if the test is stable (xref: #10868 (comment)) |
Not occurred today. It was occurred often before the fix. We can close this issue and the PR. |
I did the test with 200 count, all passed. I would say we are safe to close this issue. |
Great! Thx for testing /close |
@sbueringer: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Which jobs are flaking?
capi-test-main
Which tests are flaking?
TestClusterResourceSetReconciler
Since when has it been flaking?
Most likely after merging this PR: #10656
Testgrid link
edited: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-mink8s-main/1810533764078505984
Reason for failure (if possible)
No response
Anything else we need to know?
No response
Label(s) to be applied
/kind flake
/area ci
The text was updated successfully, but these errors were encountered: