-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delete LRO state when operations fail #4011
Conversation
2fd5cd3
to
1ceac0f
Compare
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #4011 +/- ##
==========================================
+ Coverage 56.60% 56.63% +0.02%
==========================================
Files 187 187
Lines 19124 19131 +7
==========================================
+ Hits 10825 10834 +9
+ Misses 7669 7668 -1
+ Partials 630 629 -1
☔ View full report in Codecov by Sentry. |
/test pull-cluster-api-provider-azure-e2e-aks |
1 similar comment
/test pull-cluster-api-provider-azure-e2e-aks |
/cherry-pick release-1.11 |
@nojnhuh: once the present PR merges, I will cherry-pick it on top of release-1.11 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This time it didn't fail, but a few of these popped up:
I'd like to know if the ones causing the test to fail look the same though. /test pull-cluster-api-provider-azure-e2e-aks |
/test pull-cluster-api-provider-azure-e2e-aks |
Finally got a repro and this is what the new log is writing: |
/test pull-cluster-api-provider-azure-e2e-aks |
4 similar comments
/test pull-cluster-api-provider-azure-e2e-aks |
/test pull-cluster-api-provider-azure-e2e-aks |
/test pull-cluster-api-provider-azure-e2e-aks |
/test pull-cluster-api-provider-azure-e2e-aks |
The root of the specific problem we're seeing in the e2e test seems to be that we weren't throttling transient errors in the AMMP controller. It looks like there's some hiccup when we end up sending a bunch of deletes in a short timeframe where AKS starts two different delete operations on the same agent pool. AKS lets the second delete supersede the first, but CAPZ doesn't notice the second one so keeps polling the first one, leading to the above error. I think adding transient error handling to the AMMP controller is an obvious-enough improvement that that's worth merging. That alone might fix the flakes, though it doesn't really target the scope of #3970. @mboersma @CecileRobertMichon Do you have thoughts on whether or not to keep the logging that this PR currently adds in case related issues pop up later? |
Why are we sending a DELETE API request if there is already an ongoing long running DELETE operation? that seems like a bug that could also lead to API throttling |
I didn't get that far, but adding the RequeueAfter for transient errors seems to get around this AFAICT. |
The only theory I can posit is that we add the |
It seems to me that "controller requeues too often for transient errors" and "async poller sends a DELETE call when there is already an ongoing one" are potentially two separate bugs, even if the first mitigates the symptoms of the second we should dig deeper to understand why the second occurs |
I've opened a separate PR with the AMMP transient error handling and I'll leave this open to continue investigating. |
So I think the root of the problem is that we set the With the changes I just pushed, I hit the same error locally (only once for the pool, as expected) and the test seemed to recover. Here are the full logs for that delete: poolspot-delete.log |
@@ -106,20 +106,23 @@ func (s *Service[C, D]) CreateOrUpdateResource(ctx context.Context, spec azure.R | |||
|
|||
result, poller, err := s.Creator.CreateOrUpdateAsync(ctx, spec, resumeToken, parameters) | |||
errWrapped := errors.Wrapf(err, "failed to create or update resource %s/%s (service: %s)", rgName, resourceName, serviceName) | |||
if poller != nil { | |||
if poller != nil && azure.IsContextDeadlineExceededOrCanceledError(err) { | |||
future, err := converters.PollerToFuture(poller, infrav1.PutFuture, serviceName, resourceName, rgName) | |||
if err != nil { | |||
return nil, errWrapped | |||
} | |||
s.Scope.SetLongRunningOperationState(future) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functional change here is to only SetLongRunningOperationState
when the error is something like "not done yet." In all other cases, successful or not, we DeleteLongRunningOperationState
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking for clarity(I may be way off the workflow), what happens to LongRunningRequests if they fail or if the higher abstraction resource cancels the context due to some failure ? Do we intend to re-queue this resource's request and not save its state in Status.Condition
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. If an operation fails, then it's done, and we shouldn't keep polling that specific operation because it will remain failed. In that case we return the error which will requeue the resource and a new operation will start.
I'm fairly confident this fixes the issue now. /retitle delete LRO state when operations fail |
} | ||
|
||
// Once the operation is done, delete the long-running operation state. | ||
// Once the operation is done, delete the long-running operation state. Even if the operation ended with | ||
// an error, clear out any lingering state to try the operation again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the equivalent of what we were doing before a521b7b#diff-0e657fbf13cf152e97cf8871a3baf550199d64f99f94316e7e1b9eeb5d6cc8e4L90
Great find @nojnhuh Do we still want the change to handle transient errors in the AMMP controller so we don't requeue too aggressively? |
Thanks! And yes, I think that is also valuable still. |
I just saw you had a seperate PR open for that one already #4039 :) /lgtm |
LGTM label has been added. Git tree hash: 6bf421c6c1b2c5f333562c319ca9fd7eb5fc9a61
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Needs squash. Nice work!
e6a67c4
to
619dfd0
Compare
squashed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@nojnhuh: new pull request created: #4044 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind bug
/kind flake
What this PR does / why we need it:
This PR fixes the CI flakes caused by #3970 by ensuring long-running operation states do not persist when the operation fails.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #3970
Special notes for your reviewer:
/cherry-pick release-1.11
TODOs:
Release note: