delete LRO state when operations fail #4011

nojnhuh · 2023-09-19T18:02:16Z

What type of PR is this?
/kind bug
/kind flake

What this PR does / why we need it:

This PR fixes the CI flakes caused by #3970 by ensuring long-running operation states do not persist when the operation fails.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3970

Special notes for your reviewer:

cherry-pick candidate
/cherry-pick release-1.11

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

NONE

codecov · 2023-09-19T18:17:16Z

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (f86e4e2) 56.60% compared to head (619dfd0) 56.63%.
Report is 30 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4011      +/-   ##
==========================================
+ Coverage   56.60%   56.63%   +0.02%     
==========================================
  Files         187      187              
  Lines       19124    19131       +7     
==========================================
+ Hits        10825    10834       +9     
+ Misses       7669     7668       -1     
+ Partials      630      629       -1

Files	Coverage Δ
azure/services/async/async.go	`81.19% <100.00%> (+8.54%)`	⬆️

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nojnhuh · 2023-09-19T19:45:20Z

/test pull-cluster-api-provider-azure-e2e-aks

nojnhuh · 2023-09-19T21:04:19Z

/test pull-cluster-api-provider-azure-e2e-aks

nojnhuh · 2023-09-19T21:33:30Z

/cherry-pick release-1.11

k8s-infra-cherrypick-robot · 2023-09-19T21:33:31Z

@nojnhuh: once the present PR merges, I will cherry-pick it on top of release-1.11 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nojnhuh · 2023-09-19T23:37:06Z

This time it didn't fail, but a few of these popped up:

I0919 21:38:16.442214       1 async.go:113] async.Service.CreateOrUpdateResource "msg"="CreateOrUpdateAsync returned poller in terminal state" "AzureManagedMachinePool"={"name":"capz-e2e-tycqsv-aks-pool1","namespace":"capz-e2e-tycqsv"} "controller"="azuremanagedmachinepool" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureManagedMachinePool" "error"="context deadline exceeded" "name"="capz-e2e-tycqsv-aks-pool1" "namespace"="capz-e2e-tycqsv" "reconcileID"="13f8d044-d520-41c4-9a00-763c3ae30812" "resultErr"=null "x-ms-correlation-request-id"="a200e06a-73f7-455f-9d8b-237ca0913740"

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4011/pull-cluster-api-provider-azure-e2e-aks/1704240032052678656

I'd like to know if the ones causing the test to fail look the same though.

/test pull-cluster-api-provider-azure-e2e-aks

nojnhuh · 2023-09-20T01:33:27Z

/test pull-cluster-api-provider-azure-e2e-aks

nojnhuh · 2023-09-20T03:51:19Z

Finally got a repro and this is what the new log is writing:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4011/pull-cluster-api-provider-azure-e2e-aks/1704307759115145216
I0920 02:35:39.306971 1 async.go:160] async.Service.DeleteResource "msg"="DeleteAsync returned poller in terminal state" "AzureManagedMachinePool"={"name":"pool3","namespace":"capz-e2e-6c5zgv"} "controller"="azuremanagedmachinepool" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureManagedMachinePool" "error"="GET https://management.azure.com/subscriptions/===REDACTED===/providers/Microsoft.ContainerService/locations/westeurope/operations/1cbd5af2-4080-45d9-8d28-5fef6ea455bd\n--------------------------------------------------------------------------------\nRESPONSE 200: 200 OK\nERROR CODE UNAVAILABLE\n--------------------------------------------------------------------------------\n{\n \"name\": \"f25abd1c-8040-d945-8d28-5fef6ea455bd\",\n \"status\": \"Canceled\",\n \"startTime\": \"2023-09-20T02:35:29.2316449Z\",\n \"endTime\": \"2023-09-20T02:35:37.4607486Z\",\n \"error\": {\n \"code\": \"\",\n \"message\": \"operation discarded: Category: InternalError; Code: NotLatestOperation; SubCode: ; Message: Cannot proceed with the operation. Either the operation has been preempted by another one, or the information needed by the operation failed to be saved (or hasn't been saved yet).; InnerMessage: Operation ID does not match latest operation ID in goal state. Operation ID: 1cbd5af2-4080-45d9-8d28-5fef6ea455bd, goalstate latest operation ID: 966b6f7a-fa74-4968-ac3a-48aa70f7fc37. This could mean that after the operation is enqueued, the code that saves the goal state kept failing (HCP/database issue), or the operation has been preempted by a new DELETE operation.; Dependency: ; AKSTeam: ; OriginalError: current goalstate is not associated with the ongoing operation\"\n }\n}\n--------------------------------------------------------------------------------\n" "name"="pool3" "namespace"="capz-e2e-6c5zgv" "reconcileID"="478e6211-4faf-4b5b-921e-17e5960b4927" "resultErr"="GET https://management.azure.com/subscriptions/===REDACTED===/providers/Microsoft.ContainerService/locations/westeurope/operations/1cbd5af2-4080-45d9-8d28-5fef6ea455bd\n--------------------------------------------------------------------------------\nRESPONSE 200: 200 OK\nERROR CODE UNAVAILABLE\n--------------------------------------------------------------------------------\n{\n \"name\": \"f25abd1c-8040-d945-8d28-5fef6ea455bd\",\n \"status\": \"Canceled\",\n \"startTime\": \"2023-09-20T02:35:29.2316449Z\",\n \"endTime\": \"2023-09-20T02:35:37.4607486Z\",\n \"error\": {\n \"code\": \"\",\n \"message\": \"operation discarded: Category: InternalError; Code: NotLatestOperation; SubCode: ; Message: Cannot proceed with the operation. Either the operation has been preempted by another one, or the information needed by the operation failed to be saved (or hasn't been saved yet).; InnerMessage: Operation ID does not match latest operation ID in goal state. Operation ID: 1cbd5af2-4080-45d9-8d28-5fef6ea455bd, goalstate latest operation ID: 966b6f7a-fa74-4968-ac3a-48aa70f7fc37. This could mean that after the operation is enqueued, the code that saves the goal state kept failing (HCP/database issue), or the operation has been preempted by a new DELETE operation.; Dependency: ; AKSTeam: ; OriginalError: current goalstate is not associated with the ongoing operation\"\n }\n}\n--------------------------------------------------------------------------------\n" "x-ms-correlation-request-id"="2e902dca-a495-4e9a-95d5-838fe833e34e"

nojnhuh · 2023-09-22T16:17:57Z

/test pull-cluster-api-provider-azure-e2e-aks

nojnhuh · 2023-09-22T17:42:45Z

/test pull-cluster-api-provider-azure-e2e-aks

nojnhuh · 2023-09-22T19:09:20Z

/test pull-cluster-api-provider-azure-e2e-aks

nojnhuh · 2023-09-22T20:30:26Z

/test pull-cluster-api-provider-azure-e2e-aks

nojnhuh · 2023-09-22T21:50:55Z

/test pull-cluster-api-provider-azure-e2e-aks

nojnhuh · 2023-09-25T16:29:28Z

The root of the specific problem we're seeing in the e2e test seems to be that we weren't throttling transient errors in the AMMP controller. It looks like there's some hiccup when we end up sending a bunch of deletes in a short timeframe where AKS starts two different delete operations on the same agent pool. AKS lets the second delete supersede the first, but CAPZ doesn't notice the second one so keeps polling the first one, leading to the above error.

I think adding transient error handling to the AMMP controller is an obvious-enough improvement that that's worth merging. That alone might fix the flakes, though it doesn't really target the scope of #3970. @mboersma @CecileRobertMichon Do you have thoughts on whether or not to keep the logging that this PR currently adds in case related issues pop up later?

CecileRobertMichon · 2023-09-25T18:58:38Z

It looks like there's some hiccup when we end up sending a bunch of deletes in a short timeframe where AKS starts two different delete operations on the same agent pool

Why are we sending a DELETE API request if there is already an ongoing long running DELETE operation? that seems like a bug that could also lead to API throttling

nojnhuh · 2023-09-25T19:01:52Z

It looks like there's some hiccup when we end up sending a bunch of deletes in a short timeframe where AKS starts two different delete operations on the same agent pool

Why are we sending a DELETE API request if there is already an ongoing long running DELETE operation? that seems like a bug that could also lead to API throttling

I didn't get that far, but adding the RequeueAfter for transient errors seems to get around this AFAICT.

nojnhuh · 2023-09-25T19:04:52Z

The only theory I can posit is that we add the longRunningOperationStates for the delete, immediately requeue, and the read cache is still stale and the object doesn't have that resume token yet, so we start a new operation. I'd hope that the client is smart enough to cache the result of the write though.

CecileRobertMichon · 2023-09-25T19:05:36Z

I didn't get that far, but adding the RequeueAfter for transient errors seems to get around this AFAICT.

It seems to me that "controller requeues too often for transient errors" and "async poller sends a DELETE call when there is already an ongoing one" are potentially two separate bugs, even if the first mitigates the symptoms of the second we should dig deeper to understand why the second occurs

nojnhuh · 2023-09-26T17:50:25Z

I've opened a separate PR with the AMMP transient error handling and I'll leave this open to continue investigating.
#4039

nojnhuh · 2023-09-26T20:23:58Z

So I think the root of the problem is that we set the longRunningOperationStates whenever an operation fails, for any reason, where we should be checking for sure that the operation failed because it's not done yet. If it fails for any other reason, we should delete the existing longRunningOperationStates so the next reconcile tries again from scratch.

With the changes I just pushed, I hit the same error locally (only once for the pool, as expected) and the test seemed to recover. Here are the full logs for that delete: poolspot-delete.log

nojnhuh · 2023-09-26T20:25:35Z

azure/services/async/async.go

@@ -106,20 +106,23 @@ func (s *Service[C, D]) CreateOrUpdateResource(ctx context.Context, spec azure.R

 	result, poller, err := s.Creator.CreateOrUpdateAsync(ctx, spec, resumeToken, parameters)
 	errWrapped := errors.Wrapf(err, "failed to create or update resource %s/%s (service: %s)", rgName, resourceName, serviceName)
-	if poller != nil {
+	if poller != nil && azure.IsContextDeadlineExceededOrCanceledError(err) {
 		future, err := converters.PollerToFuture(poller, infrav1.PutFuture, serviceName, resourceName, rgName)
 		if err != nil {
 			return nil, errWrapped
 		}
 		s.Scope.SetLongRunningOperationState(future)


The functional change here is to only SetLongRunningOperationState when the error is something like "not done yet." In all other cases, successful or not, we DeleteLongRunningOperationState.

Asking for clarity(I may be way off the workflow), what happens to LongRunningRequests if they fail or if the higher abstraction resource cancels the context due to some failure ? Do we intend to re-queue this resource's request and not save its state in Status.Condition?

Right. If an operation fails, then it's done, and we shouldn't keep polling that specific operation because it will remain failed. In that case we return the error which will requeue the resource and a new operation will start.

nojnhuh · 2023-09-26T20:29:13Z

I'm fairly confident this fixes the issue now.

/retitle delete LRO state when operations fail
/hold for squash
/assign @mboersma @CecileRobertMichon

CecileRobertMichon · 2023-09-26T20:36:24Z

azure/services/async/async.go

 	}

-	// Once the operation is done, delete the long-running operation state.
+	// Once the operation is done, delete the long-running operation state. Even if the operation ended with
+	// an error, clear out any lingering state to try the operation again.


this is the equivalent of what we were doing before a521b7b#diff-0e657fbf13cf152e97cf8871a3baf550199d64f99f94316e7e1b9eeb5d6cc8e4L90

CecileRobertMichon · 2023-09-26T20:38:18Z

Great find @nojnhuh

Do we still want the change to handle transient errors in the AMMP controller so we don't requeue too aggressively?

nojnhuh · 2023-09-26T20:46:04Z

Great find @nojnhuh

Do we still want the change to handle transient errors in the AMMP controller so we don't requeue too aggressively?

Thanks! And yes, I think that is also valuable still.

CecileRobertMichon · 2023-09-26T20:47:28Z

I just saw you had a seperate PR open for that one already #4039 :)

/lgtm

k8s-ci-robot · 2023-09-26T20:47:33Z

LGTM label has been added.

Git tree hash: 6bf421c6c1b2c5f333562c319ca9fd7eb5fc9a61

mboersma

/lgtm

Needs squash. Nice work!

nojnhuh · 2023-09-26T23:44:54Z

squashed!
/hold cancel

CecileRobertMichon

/approve

k8s-ci-robot · 2023-09-26T23:47:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [CecileRobertMichon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-infra-cherrypick-robot · 2023-09-27T01:10:13Z

@nojnhuh: new pull request created: #4044

In response to this:

/cherry-pick release-1.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot requested review from jackfrancis and willie-yao September 19, 2023 18:02

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 19, 2023

nojnhuh force-pushed the async-done-poller branch from 2fd5cd3 to 1ceac0f Compare September 19, 2023 18:05

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 22, 2023

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 26, 2023

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 26, 2023

nojnhuh commented Sep 26, 2023

View reviewed changes

k8s-ci-robot assigned CecileRobertMichon Sep 26, 2023

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2023

k8s-ci-robot changed the title ~~[WIP] fix error when Create/Update/DeleteAsync return Done pollers~~ delete LRO state when operations fail Sep 26, 2023

k8s-ci-robot assigned mboersma Sep 26, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2023

CecileRobertMichon reviewed Sep 26, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 26, 2023

mboersma reviewed Sep 26, 2023

View reviewed changes

delete LRO state when operations fail

619dfd0

nojnhuh force-pushed the async-done-poller branch from e6a67c4 to 619dfd0 Compare September 26, 2023 23:44

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2023

CecileRobertMichon reviewed Sep 26, 2023

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2023

k8s-ci-robot merged commit 3a40dd4 into kubernetes-sigs:main Sep 27, 2023

k8s-ci-robot added this to the v1.12 milestone Sep 27, 2023

k8s-infra-cherrypick-robot mentioned this pull request Sep 27, 2023

[release-1.11] delete LRO state when operations fail #4044

Merged

nojnhuh deleted the async-done-poller branch September 27, 2023 01:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delete LRO state when operations fail #4011

delete LRO state when operations fail #4011

nojnhuh commented Sep 19, 2023 •

edited

Loading

codecov bot commented Sep 19, 2023 •

edited

Loading

nojnhuh commented Sep 19, 2023

nojnhuh commented Sep 19, 2023

nojnhuh commented Sep 19, 2023

k8s-infra-cherrypick-robot commented Sep 19, 2023

nojnhuh commented Sep 19, 2023

nojnhuh commented Sep 20, 2023

nojnhuh commented Sep 20, 2023

nojnhuh commented Sep 22, 2023

nojnhuh commented Sep 22, 2023

nojnhuh commented Sep 22, 2023

nojnhuh commented Sep 22, 2023

nojnhuh commented Sep 22, 2023

nojnhuh commented Sep 25, 2023

CecileRobertMichon commented Sep 25, 2023

nojnhuh commented Sep 25, 2023

nojnhuh commented Sep 25, 2023

CecileRobertMichon commented Sep 25, 2023

nojnhuh commented Sep 26, 2023

nojnhuh commented Sep 26, 2023

nojnhuh Sep 26, 2023 •

edited

Loading

nawazkh Sep 27, 2023

nojnhuh Sep 27, 2023

nojnhuh commented Sep 26, 2023

CecileRobertMichon Sep 26, 2023

CecileRobertMichon commented Sep 26, 2023

nojnhuh commented Sep 26, 2023

CecileRobertMichon commented Sep 26, 2023

k8s-ci-robot commented Sep 26, 2023

mboersma left a comment

nojnhuh commented Sep 26, 2023

CecileRobertMichon left a comment

k8s-ci-robot commented Sep 26, 2023

k8s-infra-cherrypick-robot commented Sep 27, 2023

delete LRO state when operations fail #4011

delete LRO state when operations fail #4011

Conversation

nojnhuh commented Sep 19, 2023 • edited Loading

codecov bot commented Sep 19, 2023 • edited Loading

Codecov Report

nojnhuh commented Sep 19, 2023

nojnhuh commented Sep 19, 2023

nojnhuh commented Sep 19, 2023

k8s-infra-cherrypick-robot commented Sep 19, 2023

nojnhuh commented Sep 19, 2023

nojnhuh commented Sep 20, 2023

nojnhuh commented Sep 20, 2023

nojnhuh commented Sep 22, 2023

nojnhuh commented Sep 22, 2023

nojnhuh commented Sep 22, 2023

nojnhuh commented Sep 22, 2023

nojnhuh commented Sep 22, 2023

nojnhuh commented Sep 25, 2023

CecileRobertMichon commented Sep 25, 2023

nojnhuh commented Sep 25, 2023

nojnhuh commented Sep 25, 2023

CecileRobertMichon commented Sep 25, 2023

nojnhuh commented Sep 26, 2023

nojnhuh commented Sep 26, 2023

nojnhuh Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

nawazkh Sep 27, 2023

Choose a reason for hiding this comment

nojnhuh Sep 27, 2023

Choose a reason for hiding this comment

nojnhuh commented Sep 26, 2023

CecileRobertMichon Sep 26, 2023

Choose a reason for hiding this comment

CecileRobertMichon commented Sep 26, 2023

nojnhuh commented Sep 26, 2023

CecileRobertMichon commented Sep 26, 2023

k8s-ci-robot commented Sep 26, 2023

mboersma left a comment

Choose a reason for hiding this comment

nojnhuh commented Sep 26, 2023

CecileRobertMichon left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 26, 2023

k8s-infra-cherrypick-robot commented Sep 27, 2023

nojnhuh commented Sep 19, 2023 •

edited

Loading

codecov bot commented Sep 19, 2023 •

edited

Loading

nojnhuh Sep 26, 2023 •

edited

Loading