-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify status reporting of sync errors #281
Simplify status reporting of sync errors #281
Conversation
7ec2b52
to
6a77fa0
Compare
840a649
to
14895f2
Compare
e2e/testcases/multi_sync_test.go
Outdated
nt.T.Logf("Root %s should also get surfaced with the conflict error", configsync.RootSyncName) | ||
nt.WaitForRootSyncSyncError(configsync.RootSyncName, status.ManagementConflictErrorCode, "declared in another repository") | ||
// When the webhook is enabled, it will stop adoption of managed objects. | ||
// So only the second RootSync will show the conflict error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second RootSync will show the conflict error because the admission webhook denied the update. Then, the remediator of the second RootSync will detect the drift and surface the conflict to both RootSyncs. The reason for reporting to both is that the object can be removed from either of the repos.
Here is the logic to report the conflict error for the second RootSync:
https://github.com/GoogleContainerTools/kpt-config-sync/blob/main/pkg/parse/run.go#L422.
Here is how to surface a similar error to the conflicting RootSync: https://github.com/GoogleContainerTools/kpt-config-sync/blob/main/pkg/parse/run.go#L427.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So by removing UpdateConflictManagerStatus
and just merging p.RemediatorConflictErrors()
I broke the "logic to report the conflict error for the second RootSync".
It wasn't super clear what that logic was doing. But I see now it's calling conflictError.ConflictingManagerError()
on every element of the []ManagementConflictError
, which I guess inverts the error.
Why isn't that logic all just in the Remediator? It seems pretty convoluted to need to process the ManagementConflictErrors into a MultiError in the parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted this change and instead migrated both WaitForRootSyncSyncError
calls to use WaitForObject
in parallel instead. This should evaluate faster and more consistently, without actually changing the existing behavior, except to make it less flakey.
14895f2
to
db6705b
Compare
/hold Turns out I unintentionally deleted the cross-RootSync conflict reporting in this PR. While this functionality is of debatable value, it does help compensate for the flakiness of errors from the applier, which are deleted before each apply run. So after talking to @nan-yu and @janetkuo, we decided to remove cross-RootSync conflict reporting, but only after adding error merging to the applier. Error merging will hold onto apply errors until we see an event like the one that caused it. This will avoid loss of errors (including conflict errors) while the applier is running. Afterwards, we can continue this PR to remove the cross-RootSync conflict reporting. |
db6705b
to
25d2d69
Compare
Extracted WatchForObject to #288 |
25d2d69
to
fc49838
Compare
/retest |
33d83d3
to
1939586
Compare
/unhold I reverted the UpdateConflictManagerStatus removal and went with a refactor instead, to make sure it was being called consistently. We can remove it later, but it doesn't need to block the rest of these status consistency improvements. |
resources: resources, | ||
applier: app, | ||
remediator: rem, | ||
return &root{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactor just moves the opts
construction into the root
to avoid a copy of the mutex pointer, which fixes a linter error.
func UpdateConflictManagerStatus(ctx context.Context, conflictErrs []status.ManagementConflictError, k8sClient client.Client) { | ||
// reportRootSyncConflicts reports conflicts to the RootSync that manages the | ||
// conflicting resources. | ||
func reportRootSyncConflicts(ctx context.Context, k8sClient client.Client, conflictErrs []status.ManagementConflictError) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this name change is critical. I can revert it if you like the old name better. But the old name made me think THIS RootSync was being updated and not some other RootSync. I used "report" instead of "update" because it seemed more like something you might do to a remote object. I also made it lowercase because now that the async status update is here, it's only used in this file, so it doesn't need to be public any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
} | ||
if err := prependRootSyncRemediatorStatus(ctx, k8sClient, name, conflictErrors, defaultDenominator); err != nil { | ||
klog.Warningf("failed to add the management conflict error to RootSync %s: %v", name, err) | ||
if scope == declared.RootReconciler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This little refactor hopefully only changes readability, comments, and logging. Otherwise it's functionally the same.
@@ -525,7 +525,9 @@ func (a *Applier) sync(ctx context.Context, objs []client.Object) (map[schema.Gr | |||
// Errors implements Interface. | |||
// Errors returns the errors encountered during apply. | |||
func (a *Applier) Errors() status.MultiError { | |||
return a.errs | |||
// TODO: Make read/write of a.errs thread-safe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we can't just add a mutex around the errors here, because they're modified in multiple places. But fixing that is a larger change I want to do in a separate PR.
klog.Infof("No need to notify namespace reconciler for the conflict") | ||
continue | ||
} | ||
if err := prependRootSyncRemediatorStatus(ctx, k8sClient, name, conflictErrors, defaultDenominator); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow this - why do we need to surface conflict errors on the other RootSync? It seems like it should be sufficient to just surface it on this reconciler's RootSync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original intent was to make the conflict easy to be discovered by both RootSyncs. It is possible that the syncing process takes 10-20 minutes and the RootSync remains syncing without any issue. The cross-posted conflict error reports the conflicts earlier.
After talking with Karl, he will remove the cross-posted error in a separate PR after adding error merging to the applier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code causes reconciler-B to update the rsync status of reconciler-A. This makes it possible for the owner of rsync-A to know there's a conflict.
Since reconciler-A applied the object first, the object has the manager set to reconciler-A. So when reconciler-B tries to update that object the apply fails, rejected by the webhook. Without this code, when the webhook is enabled, only rsync-B shows the conflict.
When the webhook is disabled, any two RootSyncs trying to manage the same object with actively fight, each taking ownership of the object with every apply. In this case, both will show the error from both the applier and the remediator.
RepoSyncs only adopt if unmanaged. So they won't fight, but can conflict. As with the webhook being enabled, this code reports the conflict on both sides.
Personally I think this behavior is confusing, harmful because it causes status flakiness, and technically incorrect because it makes the metrics not match the status API. But @nan-yu thinks we should keep it for now to make it easier to detect conflicts. Our compromise is for us to remove it later, after we improve the applier errors to not invalidate all errors when starting a new apply until we receive the event that created the error, thus making it easier to detect conflict errors from the applier.
e2e/testcases/multi_sync_test.go
Outdated
func() { | ||
nomostest.WatchForObject(nt, kinds.RootSyncV1Beta1(), rootSync2, configsync.ControllerNamespace, | ||
[]nomostest.Predicate{ | ||
nomostest.RootSyncHasSyncError(nt, status.ManagementConflictErrorCode, "declared in another repository"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, is it intentional that the error code and message changed from denied the request
to declared in another repository
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first RootSync, it should show the denied the request
error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, "denied the request" doesn't exist in the code.
"declared in another repository" is from the KptManagementConflictError
which is returned by both the applier and the remediator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
denied the request
is from the admission webhook: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/errors/statuserror.go#L29.
The error will first show up because the admission webhook denied the request. The declared in another repository
error is surfaced by the conflicting remediator. The RootSync status will flip.
Technically, we should expect the denied the request
error for the first RootSync, especially the cross-RootSync update is going to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding the denied the request
code. Not sure why it didn't come up in my search.
Should we expect both the denied the request
and the declared in another repository
then? The tests are passing as-is, which seems to indicate that the conflict error from the remediator is definitely being propagated from rootSync2 to rootSync1. Adding denied the request
would confirm that it also was detected via the apply being rejected by the webhook. But we probably wouldn't expect to see them at the same time...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can expect both errors. Both RootSync1's applier and RootSync2's remediator updates the RootSync1's status. The RootSync1's status was expected to bounce back and forth to report the conflict in different ways. For this test, we care more about the denied the request
error.
As we're planning to remove the cross-RootSync update, we can validate the denied the request
error for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the test to check for all the possible errors.
When we disable cross-RootSync update, some of these will fail and we can remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good with one minor error message. When the webhook is disabled, the first RootSync should expect a denied the request
error.
e2e/testcases/multi_sync_test.go
Outdated
func() { | ||
nomostest.WatchForObject(nt, kinds.RootSyncV1Beta1(), rootSync2, configsync.ControllerNamespace, | ||
[]nomostest.Predicate{ | ||
nomostest.RootSyncHasSyncError(nt, status.ManagementConflictErrorCode, "declared in another repository"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first RootSync, it should show the denied the request
error.
func UpdateConflictManagerStatus(ctx context.Context, conflictErrs []status.ManagementConflictError, k8sClient client.Client) { | ||
// reportRootSyncConflicts reports conflicts to the RootSync that manages the | ||
// conflicting resources. | ||
func reportRootSyncConflicts(ctx context.Context, k8sClient client.Client, conflictErrs []status.ManagementConflictError) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
klog.Infof("No need to notify namespace reconciler for the conflict") | ||
continue | ||
} | ||
if err := prependRootSyncRemediatorStatus(ctx, k8sClient, name, conflictErrors, defaultDenominator); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original intent was to make the conflict easy to be discovered by both RootSyncs. It is possible that the syncing process takes 10-20 minutes and the RootSync remains syncing without any issue. The cross-posted conflict error reports the conflicts earlier.
After talking with Karl, he will remove the cross-posted error in a separate PR after adding error merging to the applier.
1939586
to
82dbb21
Compare
- Add a new setSyncStatus that handles: - Updating the cached sync status in `state` - Checking the cached sync status to avoid unnecessary updates - Calling SetSyncStatus with the new sync errors - Calling reportRootSyncConflicts to update remote RootSyncs, if they are causing conflicts with the current RSync. - Rename UpdateConflictManagerStatus -> reportRootSyncConflicts - Rename parse.setXStatus -> setXStatusFields to allow the name setSyncStatus to be used by the wrapper that handles state - Call the new parse.setSyncStatus to make all the updates consistent, instead of calling parser.SetSyncStatus directly. - Rename updateSyncStatus -> updateSyncStatusPeriodically to indicate it handles retrying until cancelled. - Rename Parser.ApplierErrors() -> SyncErrors() and have it delegate to the new updater.Errors(), which centralizes aggregation of the following errors that are all reported as sync errors: - Remediator conflict errors - Validation errors from updating the object set - Applier errors - Watch update errors - Remove Parser.RemediatorConflictErrors() because it's obsolete with the new SyncErrors. The conflict errors are a subset of the sync errors and are extracted as-needed in setSyncErrors. This extraction probably won't be necessary in the future, as we plan to stop reporting errors across RSyncs. - Enhance TestConflictingDefinitions_RootToRoot to handle catching conflict errors in parallel, for speed and consistently. - Enahance TestDontDeleteAllNamespaces to use WatchForObject - Fix linter copylock error Change-Id: I79aa32d22080bc91a4262279de0b490974564112
82dbb21
to
0924850
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: nan-yu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
state
they are causing conflicts with the current RSync.
setSyncStatus to be used by the wrapper that handles state
consistent, instead of calling parser.SetSyncStatus directly.
indicate it handles retrying until cancelled.
delegate to the new updater.Errors(), which centralizes
aggregation of the following errors that are all reported as
sync errors:
with the new SyncErrors. The conflict errors are a subset of the
sync errors and are extracted as-needed in setSyncErrors.
This extraction probably won't be necessary in the future, as we
plan to stop reporting errors across RSyncs.
conflict errors in parallel, for speed and consistently.