Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delete LRO state when operations fail #4011

Merged
merged 1 commit into from
Sep 27, 2023

Conversation

nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Sep 19, 2023

What type of PR is this?
/kind bug
/kind flake

What this PR does / why we need it:

This PR fixes the CI flakes caused by #3970 by ensuring long-running operation states do not persist when the operation fails.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3970

Special notes for your reviewer:

  • cherry-pick candidate
    /cherry-pick release-1.11

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 19, 2023
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 19, 2023
@codecov
Copy link

codecov bot commented Sep 19, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (f86e4e2) 56.60% compared to head (619dfd0) 56.63%.
Report is 30 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4011      +/-   ##
==========================================
+ Coverage   56.60%   56.63%   +0.02%     
==========================================
  Files         187      187              
  Lines       19124    19131       +7     
==========================================
+ Hits        10825    10834       +9     
+ Misses       7669     7668       -1     
+ Partials      630      629       -1     
Files Coverage Δ
azure/services/async/async.go 81.19% <100.00%> (+8.54%) ⬆️

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 19, 2023

/test pull-cluster-api-provider-azure-e2e-aks

1 similar comment
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 19, 2023

/test pull-cluster-api-provider-azure-e2e-aks

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 19, 2023

/cherry-pick release-1.11

@k8s-infra-cherrypick-robot

@nojnhuh: once the present PR merges, I will cherry-pick it on top of release-1.11 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 19, 2023

This time it didn't fail, but a few of these popped up:

I0919 21:38:16.442214       1 async.go:113] async.Service.CreateOrUpdateResource "msg"="CreateOrUpdateAsync returned poller in terminal state" "AzureManagedMachinePool"={"name":"capz-e2e-tycqsv-aks-pool1","namespace":"capz-e2e-tycqsv"} "controller"="azuremanagedmachinepool" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureManagedMachinePool" "error"="context deadline exceeded" "name"="capz-e2e-tycqsv-aks-pool1" "namespace"="capz-e2e-tycqsv" "reconcileID"="13f8d044-d520-41c4-9a00-763c3ae30812" "resultErr"=null "x-ms-correlation-request-id"="a200e06a-73f7-455f-9d8b-237ca0913740"

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4011/pull-cluster-api-provider-azure-e2e-aks/1704240032052678656

I'd like to know if the ones causing the test to fail look the same though.

/test pull-cluster-api-provider-azure-e2e-aks

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 20, 2023

/test pull-cluster-api-provider-azure-e2e-aks

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 20, 2023

Finally got a repro and this is what the new log is writing:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4011/pull-cluster-api-provider-azure-e2e-aks/1704307759115145216
I0920 02:35:39.306971 1 async.go:160] async.Service.DeleteResource "msg"="DeleteAsync returned poller in terminal state" "AzureManagedMachinePool"={"name":"pool3","namespace":"capz-e2e-6c5zgv"} "controller"="azuremanagedmachinepool" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureManagedMachinePool" "error"="GET https://management.azure.com/subscriptions/===REDACTED===/providers/Microsoft.ContainerService/locations/westeurope/operations/1cbd5af2-4080-45d9-8d28-5fef6ea455bd\n--------------------------------------------------------------------------------\nRESPONSE 200: 200 OK\nERROR CODE UNAVAILABLE\n--------------------------------------------------------------------------------\n{\n \"name\": \"f25abd1c-8040-d945-8d28-5fef6ea455bd\",\n \"status\": \"Canceled\",\n \"startTime\": \"2023-09-20T02:35:29.2316449Z\",\n \"endTime\": \"2023-09-20T02:35:37.4607486Z\",\n \"error\": {\n \"code\": \"\",\n \"message\": \"operation discarded: Category: InternalError; Code: NotLatestOperation; SubCode: ; Message: Cannot proceed with the operation. Either the operation has been preempted by another one, or the information needed by the operation failed to be saved (or hasn't been saved yet).; InnerMessage: Operation ID does not match latest operation ID in goal state. Operation ID: 1cbd5af2-4080-45d9-8d28-5fef6ea455bd, goalstate latest operation ID: 966b6f7a-fa74-4968-ac3a-48aa70f7fc37. This could mean that after the operation is enqueued, the code that saves the goal state kept failing (HCP/database issue), or the operation has been preempted by a new DELETE operation.; Dependency: ; AKSTeam: ; OriginalError: current goalstate is not associated with the ongoing operation\"\n }\n}\n--------------------------------------------------------------------------------\n" "name"="pool3" "namespace"="capz-e2e-6c5zgv" "reconcileID"="478e6211-4faf-4b5b-921e-17e5960b4927" "resultErr"="GET https://management.azure.com/subscriptions/===REDACTED===/providers/Microsoft.ContainerService/locations/westeurope/operations/1cbd5af2-4080-45d9-8d28-5fef6ea455bd\n--------------------------------------------------------------------------------\nRESPONSE 200: 200 OK\nERROR CODE UNAVAILABLE\n--------------------------------------------------------------------------------\n{\n \"name\": \"f25abd1c-8040-d945-8d28-5fef6ea455bd\",\n \"status\": \"Canceled\",\n \"startTime\": \"2023-09-20T02:35:29.2316449Z\",\n \"endTime\": \"2023-09-20T02:35:37.4607486Z\",\n \"error\": {\n \"code\": \"\",\n \"message\": \"operation discarded: Category: InternalError; Code: NotLatestOperation; SubCode: ; Message: Cannot proceed with the operation. Either the operation has been preempted by another one, or the information needed by the operation failed to be saved (or hasn't been saved yet).; InnerMessage: Operation ID does not match latest operation ID in goal state. Operation ID: 1cbd5af2-4080-45d9-8d28-5fef6ea455bd, goalstate latest operation ID: 966b6f7a-fa74-4968-ac3a-48aa70f7fc37. This could mean that after the operation is enqueued, the code that saves the goal state kept failing (HCP/database issue), or the operation has been preempted by a new DELETE operation.; Dependency: ; AKSTeam: ; OriginalError: current goalstate is not associated with the ongoing operation\"\n }\n}\n--------------------------------------------------------------------------------\n" "x-ms-correlation-request-id"="2e902dca-a495-4e9a-95d5-838fe833e34e"

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 22, 2023
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 22, 2023

/test pull-cluster-api-provider-azure-e2e-aks

4 similar comments
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 22, 2023

/test pull-cluster-api-provider-azure-e2e-aks

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 22, 2023

/test pull-cluster-api-provider-azure-e2e-aks

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 22, 2023

/test pull-cluster-api-provider-azure-e2e-aks

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 22, 2023

/test pull-cluster-api-provider-azure-e2e-aks

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 25, 2023

The root of the specific problem we're seeing in the e2e test seems to be that we weren't throttling transient errors in the AMMP controller. It looks like there's some hiccup when we end up sending a bunch of deletes in a short timeframe where AKS starts two different delete operations on the same agent pool. AKS lets the second delete supersede the first, but CAPZ doesn't notice the second one so keeps polling the first one, leading to the above error.

I think adding transient error handling to the AMMP controller is an obvious-enough improvement that that's worth merging. That alone might fix the flakes, though it doesn't really target the scope of #3970. @mboersma @CecileRobertMichon Do you have thoughts on whether or not to keep the logging that this PR currently adds in case related issues pop up later?

@CecileRobertMichon
Copy link
Contributor

It looks like there's some hiccup when we end up sending a bunch of deletes in a short timeframe where AKS starts two different delete operations on the same agent pool

Why are we sending a DELETE API request if there is already an ongoing long running DELETE operation? that seems like a bug that could also lead to API throttling

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 25, 2023

It looks like there's some hiccup when we end up sending a bunch of deletes in a short timeframe where AKS starts two different delete operations on the same agent pool

Why are we sending a DELETE API request if there is already an ongoing long running DELETE operation? that seems like a bug that could also lead to API throttling

I didn't get that far, but adding the RequeueAfter for transient errors seems to get around this AFAICT.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 25, 2023

The only theory I can posit is that we add the longRunningOperationStates for the delete, immediately requeue, and the read cache is still stale and the object doesn't have that resume token yet, so we start a new operation. I'd hope that the client is smart enough to cache the result of the write though.

@CecileRobertMichon
Copy link
Contributor

I didn't get that far, but adding the RequeueAfter for transient errors seems to get around this AFAICT.

It seems to me that "controller requeues too often for transient errors" and "async poller sends a DELETE call when there is already an ongoing one" are potentially two separate bugs, even if the first mitigates the symptoms of the second we should dig deeper to understand why the second occurs

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 26, 2023

I've opened a separate PR with the AMMP transient error handling and I'll leave this open to continue investigating.
#4039

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 26, 2023
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 26, 2023

So I think the root of the problem is that we set the longRunningOperationStates whenever an operation fails, for any reason, where we should be checking for sure that the operation failed because it's not done yet. If it fails for any other reason, we should delete the existing longRunningOperationStates so the next reconcile tries again from scratch.

With the changes I just pushed, I hit the same error locally (only once for the pool, as expected) and the test seemed to recover. Here are the full logs for that delete: poolspot-delete.log

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 26, 2023
@@ -106,20 +106,23 @@ func (s *Service[C, D]) CreateOrUpdateResource(ctx context.Context, spec azure.R

result, poller, err := s.Creator.CreateOrUpdateAsync(ctx, spec, resumeToken, parameters)
errWrapped := errors.Wrapf(err, "failed to create or update resource %s/%s (service: %s)", rgName, resourceName, serviceName)
if poller != nil {
if poller != nil && azure.IsContextDeadlineExceededOrCanceledError(err) {
future, err := converters.PollerToFuture(poller, infrav1.PutFuture, serviceName, resourceName, rgName)
if err != nil {
return nil, errWrapped
}
s.Scope.SetLongRunningOperationState(future)
Copy link
Contributor Author

@nojnhuh nojnhuh Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functional change here is to only SetLongRunningOperationState when the error is something like "not done yet." In all other cases, successful or not, we DeleteLongRunningOperationState.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking for clarity(I may be way off the workflow), what happens to LongRunningRequests if they fail or if the higher abstraction resource cancels the context due to some failure ? Do we intend to re-queue this resource's request and not save its state in Status.Condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. If an operation fails, then it's done, and we shouldn't keep polling that specific operation because it will remain failed. In that case we return the error which will requeue the resource and a new operation will start.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 26, 2023

I'm fairly confident this fixes the issue now.

/retitle delete LRO state when operations fail
/hold for squash
/assign @mboersma @CecileRobertMichon

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2023
@k8s-ci-robot k8s-ci-robot changed the title [WIP] fix error when Create/Update/DeleteAsync return Done pollers delete LRO state when operations fail Sep 26, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2023
}

// Once the operation is done, delete the long-running operation state.
// Once the operation is done, delete the long-running operation state. Even if the operation ended with
// an error, clear out any lingering state to try the operation again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the equivalent of what we were doing before a521b7b#diff-0e657fbf13cf152e97cf8871a3baf550199d64f99f94316e7e1b9eeb5d6cc8e4L90

@CecileRobertMichon
Copy link
Contributor

Great find @nojnhuh

Do we still want the change to handle transient errors in the AMMP controller so we don't requeue too aggressively?

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 26, 2023

Great find @nojnhuh

Do we still want the change to handle transient errors in the AMMP controller so we don't requeue too aggressively?

Thanks! And yes, I think that is also valuable still.

@CecileRobertMichon
Copy link
Contributor

I just saw you had a seperate PR open for that one already #4039 :)

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 26, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 6bf421c6c1b2c5f333562c319ca9fd7eb5fc9a61

Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Needs squash. Nice work!

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 26, 2023

squashed!
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2023
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2023
@k8s-ci-robot k8s-ci-robot merged commit 3a40dd4 into kubernetes-sigs:main Sep 27, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.12 milestone Sep 27, 2023
@k8s-infra-cherrypick-robot

@nojnhuh: new pull request created: #4044

In response to this:

/cherry-pick release-1.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nojnhuh nojnhuh deleted the async-done-poller branch September 27, 2023 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

asyncpoller fails when SDK PollUntilDone returns a Done poller
6 participants