Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve thread-safety of sync status updates #301

Merged

Conversation

karlkfi
Copy link
Contributor

@karlkfi karlkfi commented Nov 16, 2022

  • Wrap applier errors in their own mutex
  • Move applier.Syncing() to Parser.Syncing() with updater.Updating() doing the heavy lifting. The Syncing condition status should reflect the whole updater, not just the applier.
  • Rename applier.Interface -> KptApplier to make room for adding KptDestroyer in the future.
  • Rename KptApplier.sync -> applyInner to make room for adding destroyInner in the future.
  • Refactor state.syncStatus and Parser.SetSyncStatus to use a new syncStatus struct, which is like the sourceStatus struct but with an added sync bool. This should also be less confusing, because state.syncStatus no longer uses a sourceStatus struct as its value.
  • Add a new setSyncStatusErrors func to use in prependRootSyncRemediatorStatus, to avoid needing to to re-construct the initial syncStatus when prepending errors. (Note: This cross-rsync update changes the .status.sync.errs and .status.sync.lastUpdate without updating the Syncing condition or reconciler_errors metric, which means they will continue to be out of sync when there's a conflict. But this is not a new problem.)

- Wrap applier errors in their own mutex
- Move applier.Syncing() to Parser.Syncing() with updater.Updating()
  doing the heavy lifting. The Syncing condition status should
  reflect the whole updater, not just the applier.
- Rename applier.Interface -> KptApplier to make room for adding
  KptDestroyer in the future.
- Rename KptApplier.sync -> applyInner to make room for adding
  destroyInner in the future.
- Refactor state.syncStatus and Parser.SetSyncStatus to use a new
  syncStatus struct, which is like the sourceStatus struct but with
  an added sync bool. This should also be less confusing, because
  state.syncStatus no longer uses a sourceStatus struct as its value.
- Add a new setSyncStatusErrors func to use in
  prependRootSyncRemediatorStatus, to avoid needing to to re-construct
  the initial syncStatus when prepending errors.

Change-Id: Ib4316b8094c371d9371e7bb48bd3b309e668a78b
@@ -77,38 +77,37 @@ type Applier struct {
// errs tracks all the errors the applier encounters.
// This field is cleared at the start of the `Applier.Apply` method
errs status.MultiError
// syncing indicates whether the applier is syncing.
syncing bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh man was this another bool used for synchronization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. and set outside of the lock.

// Ideally we want to avoid invalidating errors that will continue to happen,
// but for now, invalidate all errors until they recur.
// TODO: improve error cache invalidation to make rsync status more stable
a.invalidateErrors()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the plan for how we want to handle this in the long run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm writing a doc. But the TLDR is to do more careful invalidation by caching what event type and object added/owns the error, and not invalidating that error until we see that event again, or the apply ends.

u.watchErrs = errs
}

// Updating returns true if the Update method is running.
func (u *updater) Updating() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This smells a bit like the other bool. Is the idea that by setting this at a higher level (rather than reaching into the applier to get the value) it is less prone to race conditions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving it up a level allows including the other actions performed by updater.Update().

Setting the bool inside the newly added mutex ensures that the bool value is not affected by race conditions.

But yes, it's a lazy fix, not an ideal fix. Ideally the control flow would be modified to use channels, but I decided to start with the quick fix, because I've had trouble getting large rewrites tested and approved. And I'd like to get back to finishing the finalizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The errors and syncing bool could be bundled together in a Status() result or something, to be atomic, and the current status could be the result of a for loop that reads from a channel accepting errors and start/stop signals.

But once you go down that road, you realize that the applier, updater, parser, and reconciler layers ALL have the same problematic pattern and ALL need to be rewritten to use a top level reconcile loop. Then you wouldn't need to synchronize any of these lower layers.

Copy link
Contributor

@nan-yu nan-yu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nan-yu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@haiyanmeng
Copy link
Contributor

/hold

@haiyanmeng
Copy link
Contributor

/unhold

@karlkfi
Copy link
Contributor Author

karlkfi commented Nov 17, 2022

/retest

Looks like TestInvalidRepoSyncBranchStatus might be flakey with a missing reconciler_errors

@google-oss-prow google-oss-prow bot merged commit 97dce05 into GoogleContainerTools:main Nov 17, 2022
@karlkfi karlkfi deleted the karl-applier-errors branch November 17, 2022 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants