Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconciler/managed: avoid requeuing if an update event is pending #527

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sttts
Copy link
Contributor

@sttts sttts commented Aug 24, 2023

Description of your changes

We use this:

return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)

This pattern means that the object is immediately (modulo queue delay) reconciled again. We use cached clients (controller-runtime's default). Chance is high that the immediate client.Get call will return a stale object if that Update did a change. We will run all the complex reconcile logic (all the work twice), create noise and all that, events for example, to eventually find out that the world has moved on (through our own update) and our update of the second reconcile iteration will fail with a conflict error (another event).

The right pattern would be to only requeue if the update was a noop (= resource version didn't change), or even just set Requeue: false. If the update did something, we will see a watch event anyway (for watched resources at least).

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

Signed-off-by: Dr. Stefan Schimanski <stefan.schimanski@upbound.io>
@sttts sttts requested review from a team as code owners August 24, 2023 07:42
@sttts sttts requested review from negz and lsviben August 24, 2023 07:42
@sttts sttts changed the title reconciler/managed: avoid temporary data loss to managed on annotation update reconciler/managed: avoid requeuing if an update event is pending Aug 24, 2023
Copy link
Member

@negz negz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the intention here, but I'd prefer not to have to call a function to assess whether we need to requeue at every return point.

I'm wondering how bad it would be to just return reconcile.Result{} (i.e. implicit Requeue: false). If we did that, we'd still get a reconcile queued immediately if any of the following were true:

  • We updated the MR during this reconcile.
  • We hit a different error than we did last time (i.e. we updated the Synced condition).
  • We failed to update the MR status (i.e. we returned a non-nil error).

So the edge case we'd be susceptible to would be hitting the same error twice in a row. In that case nothing would trigger a reconcile to be queued until the poll interval expired (or a watch event happened).

Comment on lines +1136 to +1138
// the object has changed. We get a watch event and do not need to
// requeue. This helps to avoid a reconcile on a stale read when the
// informer has not caught up.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that controller-runtime would deduplicate multiple events for the same object in the queue. i.e. If we return Requeue: true and a watch event is triggered we'd only reconcile once, not twice - https://github.com/kubernetes-sigs/controller-runtime/blob/c20ea143/pkg/doc.go#L184

Why isn't this working? Is the issue that the watch-triggered reconcile has potentially already been popped from the queue and processed by another goroutine before we return Requeue: true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and a watch event is triggered we'd only reconcile once, not twice

The requeue is instant. The informer has delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the issue that the watch-triggered reconcile has potentially already been popped from the queue and processed by another goroutine before we return Requeue: true

No. There is always just one reconcile per key. Only when we return it to the queue, another go routine could take over immediately. But this PR is about the case when there is no event yet, but the key is immediately popped from the queue. That work is both unnecessary and will very likely run into another conflict error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requeue is instant. The informer has delay.

I think I understand now. You're saying the issue is that we'll requeue so fast (i.e. instantly) that we'll just read the same stale resource from the informer on the next reconcile, and essentially keep doing that until the informer cache is updated.

@negz
Copy link
Member

negz commented Aug 24, 2023

So the edge case we'd be susceptible to would be hitting the same error twice in a row. In that case nothing would trigger a reconcile to be queued until the poll interval expired (or a watch event happened).

Actually, we have to hit a case where we return reconcile.Result{RequeueAfter: r.pollInterval}. So I think in theory if we returned reconcile.Result{} in error scenarios there'd be a risk that we'd sit waiting for a watch event (or sync interval) before being requeued.

@turkenh
Copy link
Member

turkenh commented Nov 2, 2023

How different is this from #372 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants