Improve thread-safety of sync status updates #301

karlkfi · 2022-11-16T02:44:09Z

Wrap applier errors in their own mutex
Move applier.Syncing() to Parser.Syncing() with updater.Updating() doing the heavy lifting. The Syncing condition status should reflect the whole updater, not just the applier.
Rename applier.Interface -> KptApplier to make room for adding KptDestroyer in the future.
Rename KptApplier.sync -> applyInner to make room for adding destroyInner in the future.
Refactor state.syncStatus and Parser.SetSyncStatus to use a new syncStatus struct, which is like the sourceStatus struct but with an added sync bool. This should also be less confusing, because state.syncStatus no longer uses a sourceStatus struct as its value.
Add a new setSyncStatusErrors func to use in prependRootSyncRemediatorStatus, to avoid needing to to re-construct the initial syncStatus when prepending errors. (Note: This cross-rsync update changes the .status.sync.errs and .status.sync.lastUpdate without updating the Syncing condition or reconciler_errors metric, which means they will continue to be out of sync when there's a conflict. But this is not a new problem.)

- Wrap applier errors in their own mutex - Move applier.Syncing() to Parser.Syncing() with updater.Updating() doing the heavy lifting. The Syncing condition status should reflect the whole updater, not just the applier. - Rename applier.Interface -> KptApplier to make room for adding KptDestroyer in the future. - Rename KptApplier.sync -> applyInner to make room for adding destroyInner in the future. - Refactor state.syncStatus and Parser.SetSyncStatus to use a new syncStatus struct, which is like the sourceStatus struct but with an added sync bool. This should also be less confusing, because state.syncStatus no longer uses a sourceStatus struct as its value. - Add a new setSyncStatusErrors func to use in prependRootSyncRemediatorStatus, to avoid needing to to re-construct the initial syncStatus when prepending errors. Change-Id: Ib4316b8094c371d9371e7bb48bd3b309e668a78b

pkg/applier/kpt_applier.go

sdowell · 2022-11-16T19:06:28Z

pkg/applier/kpt_applier.go

@@ -77,38 +77,37 @@ type Applier struct {
 	// errs tracks all the errors the applier encounters.
 	// This field is cleared at the start of the `Applier.Apply` method
 	errs status.MultiError
-	// syncing indicates whether the applier is syncing.
-	syncing bool


oh man was this another bool used for synchronization?

yup. and set outside of the lock.

sdowell · 2022-11-16T19:09:26Z

pkg/applier/kpt_applier.go

+	// Ideally we want to avoid invalidating errors that will continue to happen,
+	// but for now, invalidate all errors until they recur.
+	// TODO: improve error cache invalidation to make rsync status more stable
+	a.invalidateErrors()


What's the plan for how we want to handle this in the long run?

I'm writing a doc. But the TLDR is to do more careful invalidation by caching what event type and object added/owns the error, and not invalidating that error until we see that event again, or the apply ends.

sdowell · 2022-11-16T19:24:11Z

pkg/parse/updater.go

 	u.watchErrs = errs
 }

+// Updating returns true if the Update method is running.
+func (u *updater) Updating() bool {


This smells a bit like the other bool. Is the idea that by setting this at a higher level (rather than reaching into the applier to get the value) it is less prone to race conditions?

Moving it up a level allows including the other actions performed by updater.Update().

Setting the bool inside the newly added mutex ensures that the bool value is not affected by race conditions.

But yes, it's a lazy fix, not an ideal fix. Ideally the control flow would be modified to use channels, but I decided to start with the quick fix, because I've had trouble getting large rewrites tested and approved. And I'd like to get back to finishing the finalizer.

The errors and syncing bool could be bundled together in a Status() result or something, to be atomic, and the current status could be the result of a for loop that reads from a channel accepting errors and start/stop signals.

But once you go down that road, you realize that the applier, updater, parser, and reconciler layers ALL have the same problematic pattern and ALL need to be rewritten to use a top level reconcile loop. Then you wouldn't need to synchronize any of these lower layers.

nan-yu

/lgtm

google-oss-prow · 2022-11-16T20:22:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nan-yu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [nan-yu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

haiyanmeng · 2022-11-16T20:34:38Z

/hold

haiyanmeng · 2022-11-17T18:04:26Z

/unhold

karlkfi · 2022-11-17T19:13:54Z

/retest

Looks like TestInvalidRepoSyncBranchStatus might be flakey with a missing reconciler_errors

karlkfi requested review from haiyanmeng and nan-yu November 16, 2022 02:44

google-oss-prow bot requested a review from mikebz November 16, 2022 02:44

google-oss-prow bot added the size/L label Nov 16, 2022

karlkfi removed the request for review from mikebz November 16, 2022 02:48

karlkfi mentioned this pull request Nov 16, 2022

[WIP] Update Syncing condition to include all errors #260

Open

haiyanmeng reviewed Nov 16, 2022

View reviewed changes

pkg/applier/kpt_applier.go Show resolved Hide resolved

sdowell reviewed Nov 16, 2022

View reviewed changes

nan-yu approved these changes Nov 16, 2022

View reviewed changes

google-oss-prow bot assigned nan-yu Nov 16, 2022

google-oss-prow bot added the lgtm label Nov 16, 2022

google-oss-prow bot added the approved label Nov 16, 2022

google-oss-prow bot added the do-not-merge/hold label Nov 16, 2022

google-oss-prow bot removed the do-not-merge/hold label Nov 17, 2022

google-oss-prow bot merged commit 97dce05 into GoogleContainerTools:main Nov 17, 2022

karlkfi deleted the karl-applier-errors branch November 17, 2022 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve thread-safety of sync status updates #301

Improve thread-safety of sync status updates #301

karlkfi commented Nov 16, 2022 •

edited

Loading

sdowell Nov 16, 2022

karlkfi Nov 16, 2022

sdowell Nov 16, 2022

karlkfi Nov 16, 2022

sdowell Nov 16, 2022

karlkfi Nov 16, 2022

karlkfi Nov 16, 2022

nan-yu left a comment

google-oss-prow bot commented Nov 16, 2022

haiyanmeng commented Nov 16, 2022

haiyanmeng commented Nov 17, 2022

karlkfi commented Nov 17, 2022

Improve thread-safety of sync status updates #301

Improve thread-safety of sync status updates #301

Conversation

karlkfi commented Nov 16, 2022 • edited Loading

sdowell Nov 16, 2022

Choose a reason for hiding this comment

karlkfi Nov 16, 2022

Choose a reason for hiding this comment

sdowell Nov 16, 2022

Choose a reason for hiding this comment

karlkfi Nov 16, 2022

Choose a reason for hiding this comment

sdowell Nov 16, 2022

Choose a reason for hiding this comment

karlkfi Nov 16, 2022

Choose a reason for hiding this comment

karlkfi Nov 16, 2022

Choose a reason for hiding this comment

nan-yu left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Nov 16, 2022

haiyanmeng commented Nov 16, 2022

haiyanmeng commented Nov 17, 2022

karlkfi commented Nov 17, 2022

karlkfi commented Nov 16, 2022 •

edited

Loading