Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not retry updates on conflict #217

Merged
merged 4 commits into from
Jul 27, 2023

Conversation

ludydoo
Copy link
Contributor

@ludydoo ludydoo commented Jun 23, 2023

When the reconciliation loops tries to update a CR, and the result is a Conflict, the reconciliation loop should not retry because this error will never succeed. The reconciliation loop needs to be restarted.

Additionally, there was a situation where the resource would be deleted, but the Updater would still try to update the deleted resource, resulting in errors.

@ludydoo ludydoo requested a review from joelanford June 26, 2023 12:42
@ludydoo ludydoo force-pushed the no-retry-on-conflict branch 2 times, most recently from 548739d to 238b674 Compare June 26, 2023 12:47
@joelanford
Copy link
Member

Does this change result in conflict errors consistently showing up in the logs during uninstall? In general, I think that's reasonable because it is what is happening, but I've seen that conflict error make many, many users nervous in lots of different operators.

It seems like if it is showing up, we would want to quickly see if its possible to eliminate the conflict issue altogether.

@ludydoo
Copy link
Contributor Author

ludydoo commented Jun 27, 2023

Does this change result in conflict errors consistently showing up in the logs during uninstall

@joelanford

Yes, we've seen it happen pretty much consistently. These changes should fix that, basically by not performing an update on a Gone resource after the controllerutils.WaitForDeletion. I think it's also a misleading apiServer error (precondition failed) rather than a 404 not found.

@ludydoo
Copy link
Contributor Author

ludydoo commented Jun 28, 2023

@joelanford let me know if there's anything I can do in terms of tweaking/changing stuff !

@joelanford
Copy link
Member

I'm being pulled in a bunch of different directions, so I'll volunteer @everettraven (if he's willing) to take a look as well.

@everettraven
Copy link
Contributor

I'll add it to my queue 😄

Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me and seem reasonable. Thanks @ludydoo for taking the time to make this contribution!

@ludydoo
Copy link
Contributor Author

ludydoo commented Jul 4, 2023

@joelanford would this be good to go ?

Comment on lines 68 to 69
// In case of a Conflict error, the update cannot be retried,
// and the reconciliation loop needs to be restarted anew.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit awkward, because these two new comment lines applies to both retry.OnError calls. However the above 3 lines only apply to the first retry.OnError call.

I can suggest two solutions:

  1. move these two lines to a separate comment above this one. However this takes it further away from the second OnError call, which is not great.
  2. create a helper function such as:
func retryOnRetryableUpdateError(backoff wait.Backoff, fn func() error) {
  return retry.OnError(backoff, isRetryableUpdateError, fn)
}

and then we could put this comment on this function, which would be the most appropriate place. We could optionally also inline isRetryableUpdateError in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good :)

@@ -521,8 +521,12 @@ func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (res ctrl.
return ctrl.Result{}, err
}

shouldUpdate := true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing a new variable here that is ready by the defer closure (and not easy to grok straight away), I think it might be cleaner to introduce a method on the updater such as SkipAllUpdates() or DoNotApply() that would:

  • cause Apply(...) to just return nil without doing anything, and
  • (optional) cause the updater to panic() on further Ensure... calls to make sure that SkipUpdates is closely followed by a return from Reconcile()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot 💯

return err
}
} else {
log.Info("Resource is already terminated, skipping deletion")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Info("Resource is already terminated, skipping deletion")
log.Info("Resource is already terminated, skipping deletion.")

if err := r.handleDeletion(ctx, actionClient, obj, log); err != nil {
return ctrl.Result{}, err
}
shouldUpdate = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly different from the case described in the PR description, right?
I'm guessing the problem here was that on deletions both the uninstallUpdater and the regular u updater would fire, and the latter was both useless and would cause errors since the resource would typically be already gone after deleting the finalizer?
If I got this right, please mention this in the PR description.

@ludydoo ludydoo requested a review from porridge July 13, 2023 10:34
Copy link
Member

@porridge porridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice, but please add tests for CancelUpdates to the updater package for completeness.

Copy link
Member

@varshaprasad96 varshaprasad96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

The changes look good to me. Agree on adding a test for CancelUpdates() to keep the code coverage same. We can merge this for now to move fwd, and will create an issue for followup as its straightforward.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 27, 2023
@varshaprasad96 varshaprasad96 merged commit 31544ee into operator-framework:main Jul 27, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sdk lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants