Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify status reporting of sync errors #281

Merged

Conversation

karlkfi
Copy link
Contributor

@karlkfi karlkfi commented Nov 11, 2022

  • Add a new setSyncStatus that handles:
    • Updating the cached sync status in state
    • Checking the cached sync status to avoid unnecessary updates
    • Calling SetSyncStatus with the new sync errors
    • Calling reportRootSyncConflicts to update remote RootSyncs, if
      they are causing conflicts with the current RSync.
  • Rename UpdateConflictManagerStatus -> reportRootSyncConflicts
  • Rename parse.setXStatus -> setXStatusFields to allow the name
    setSyncStatus to be used by the wrapper that handles state
  • Call the new parse.setSyncStatus to make all the updates
    consistent, instead of calling parser.SetSyncStatus directly.
  • Rename updateSyncStatus -> updateSyncStatusPeriodically to
    indicate it handles retrying until cancelled.
  • Rename Parser.ApplierErrors() -> SyncErrors() and have it
    delegate to the new updater.Errors(), which centralizes
    aggregation of the following errors that are all reported as
    sync errors:
    • Remediator conflict errors
    • Validation errors from updating the object set
    • Applier errors
    • Watch update errors
  • Remove Parser.RemediatorConflictErrors() because it's obsolete
    with the new SyncErrors. The conflict errors are a subset of the
    sync errors and are extracted as-needed in setSyncErrors.
    This extraction probably won't be necessary in the future, as we
    plan to stop reporting errors across RSyncs.
  • Enhance TestConflictingDefinitions_RootToRoot to handle catching
    conflict errors in parallel, for speed and consistently.
  • Fix linter copylock error

nt.T.Logf("Root %s should also get surfaced with the conflict error", configsync.RootSyncName)
nt.WaitForRootSyncSyncError(configsync.RootSyncName, status.ManagementConflictErrorCode, "declared in another repository")
// When the webhook is enabled, it will stop adoption of managed objects.
// So only the second RootSync will show the conflict error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second RootSync will show the conflict error because the admission webhook denied the update. Then, the remediator of the second RootSync will detect the drift and surface the conflict to both RootSyncs. The reason for reporting to both is that the object can be removed from either of the repos.

Here is the logic to report the conflict error for the second RootSync:
https://github.com/GoogleContainerTools/kpt-config-sync/blob/main/pkg/parse/run.go#L422.

Here is how to surface a similar error to the conflicting RootSync: https://github.com/GoogleContainerTools/kpt-config-sync/blob/main/pkg/parse/run.go#L427.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So by removing UpdateConflictManagerStatus and just merging p.RemediatorConflictErrors() I broke the "logic to report the conflict error for the second RootSync".

It wasn't super clear what that logic was doing. But I see now it's calling conflictError.ConflictingManagerError() on every element of the []ManagementConflictError, which I guess inverts the error.

Why isn't that logic all just in the Remediator? It seems pretty convoluted to need to process the ManagementConflictErrors into a MultiError in the parser.

Copy link
Contributor Author

@karlkfi karlkfi Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted this change and instead migrated both WaitForRootSyncSyncError calls to use WaitForObject in parallel instead. This should evaluate faster and more consistently, without actually changing the existing behavior, except to make it less flakey.

pkg/parse/root.go Show resolved Hide resolved
pkg/parse/run.go Show resolved Hide resolved
pkg/parse/root.go Show resolved Hide resolved
pkg/parse/run.go Show resolved Hide resolved
@karlkfi karlkfi changed the title Simplify status reporting of sync errors [WIP] Simplify status reporting of sync errors Nov 14, 2022
@karlkfi
Copy link
Contributor Author

karlkfi commented Nov 14, 2022

/hold

Turns out I unintentionally deleted the cross-RootSync conflict reporting in this PR. While this functionality is of debatable value, it does help compensate for the flakiness of errors from the applier, which are deleted before each apply run.

So after talking to @nan-yu and @janetkuo, we decided to remove cross-RootSync conflict reporting, but only after adding error merging to the applier. Error merging will hold onto apply errors until we see an event like the one that caused it. This will avoid loss of errors (including conflict errors) while the applier is running. Afterwards, we can continue this PR to remove the cross-RootSync conflict reporting.

@karlkfi
Copy link
Contributor Author

karlkfi commented Nov 14, 2022

Extracted WatchForObject to #288

@karlkfi
Copy link
Contributor Author

karlkfi commented Nov 15, 2022

/retest

@karlkfi karlkfi force-pushed the karl-combo-sync-errors branch 2 times, most recently from 33d83d3 to 1939586 Compare November 15, 2022 06:51
@karlkfi karlkfi changed the title [WIP] Simplify status reporting of sync errors Simplify status reporting of sync errors Nov 15, 2022
@karlkfi
Copy link
Contributor Author

karlkfi commented Nov 15, 2022

/unhold

I reverted the UpdateConflictManagerStatus removal and went with a refactor instead, to make sure it was being called consistently.

We can remove it later, but it doesn't need to block the rest of these status consistency improvements.

resources: resources,
applier: app,
remediator: rem,
return &root{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactor just moves the opts construction into the root to avoid a copy of the mutex pointer, which fixes a linter error.

func UpdateConflictManagerStatus(ctx context.Context, conflictErrs []status.ManagementConflictError, k8sClient client.Client) {
// reportRootSyncConflicts reports conflicts to the RootSync that manages the
// conflicting resources.
func reportRootSyncConflicts(ctx context.Context, k8sClient client.Client, conflictErrs []status.ManagementConflictError) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this name change is critical. I can revert it if you like the old name better. But the old name made me think THIS RootSync was being updated and not some other RootSync. I used "report" instead of "update" because it seemed more like something you might do to a remote object. I also made it lowercase because now that the async status update is here, it's only used in this file, so it doesn't need to be public any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

}
if err := prependRootSyncRemediatorStatus(ctx, k8sClient, name, conflictErrors, defaultDenominator); err != nil {
klog.Warningf("failed to add the management conflict error to RootSync %s: %v", name, err)
if scope == declared.RootReconciler {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This little refactor hopefully only changes readability, comments, and logging. Otherwise it's functionally the same.

@@ -525,7 +525,9 @@ func (a *Applier) sync(ctx context.Context, objs []client.Object) (map[schema.Gr
// Errors implements Interface.
// Errors returns the errors encountered during apply.
func (a *Applier) Errors() status.MultiError {
return a.errs
// TODO: Make read/write of a.errs thread-safe
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we can't just add a mutex around the errors here, because they're modified in multiple places. But fixing that is a larger change I want to do in a separate PR.

@karlkfi karlkfi requested review from nan-yu and haiyanmeng and removed request for mortent November 15, 2022 17:58
klog.Infof("No need to notify namespace reconciler for the conflict")
continue
}
if err := prependRootSyncRemediatorStatus(ctx, k8sClient, name, conflictErrors, defaultDenominator); err != nil {
Copy link
Contributor

@sdowell sdowell Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow this - why do we need to surface conflict errors on the other RootSync? It seems like it should be sufficient to just surface it on this reconciler's RootSync

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original intent was to make the conflict easy to be discovered by both RootSyncs. It is possible that the syncing process takes 10-20 minutes and the RootSync remains syncing without any issue. The cross-posted conflict error reports the conflicts earlier.

After talking with Karl, he will remove the cross-posted error in a separate PR after adding error merging to the applier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code causes reconciler-B to update the rsync status of reconciler-A. This makes it possible for the owner of rsync-A to know there's a conflict.

Since reconciler-A applied the object first, the object has the manager set to reconciler-A. So when reconciler-B tries to update that object the apply fails, rejected by the webhook. Without this code, when the webhook is enabled, only rsync-B shows the conflict.

When the webhook is disabled, any two RootSyncs trying to manage the same object with actively fight, each taking ownership of the object with every apply. In this case, both will show the error from both the applier and the remediator.

RepoSyncs only adopt if unmanaged. So they won't fight, but can conflict. As with the webhook being enabled, this code reports the conflict on both sides.

Personally I think this behavior is confusing, harmful because it causes status flakiness, and technically incorrect because it makes the metrics not match the status API. But @nan-yu thinks we should keep it for now to make it easier to detect conflicts. Our compromise is for us to remove it later, after we improve the applier errors to not invalidate all errors when starting a new apply until we receive the event that created the error, thus making it easier to detect conflict errors from the applier.

func() {
nomostest.WatchForObject(nt, kinds.RootSyncV1Beta1(), rootSync2, configsync.ControllerNamespace,
[]nomostest.Predicate{
nomostest.RootSyncHasSyncError(nt, status.ManagementConflictErrorCode, "declared in another repository"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, is it intentional that the error code and message changed from denied the request to declared in another repository?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first RootSync, it should show the denied the request error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, "denied the request" doesn't exist in the code.

"declared in another repository" is from the KptManagementConflictError which is returned by both the applier and the remediator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

denied the request is from the admission webhook: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/errors/statuserror.go#L29.

The error will first show up because the admission webhook denied the request. The declared in another repository error is surfaced by the conflicting remediator. The RootSync status will flip.

Technically, we should expect the denied the request error for the first RootSync, especially the cross-RootSync update is going to be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding the denied the request code. Not sure why it didn't come up in my search.

Should we expect both the denied the request and the declared in another repository then? The tests are passing as-is, which seems to indicate that the conflict error from the remediator is definitely being propagated from rootSync2 to rootSync1. Adding denied the request would confirm that it also was detected via the apply being rejected by the webhook. But we probably wouldn't expect to see them at the same time...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can expect both errors. Both RootSync1's applier and RootSync2's remediator updates the RootSync1's status. The RootSync1's status was expected to bounce back and forth to report the conflict in different ways. For this test, we care more about the denied the request error.

As we're planning to remove the cross-RootSync update, we can validate the denied the request error for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the test to check for all the possible errors.

When we disable cross-RootSync update, some of these will fail and we can remove them.

Copy link
Contributor

@nan-yu nan-yu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good with one minor error message. When the webhook is disabled, the first RootSync should expect a denied the request error.

func() {
nomostest.WatchForObject(nt, kinds.RootSyncV1Beta1(), rootSync2, configsync.ControllerNamespace,
[]nomostest.Predicate{
nomostest.RootSyncHasSyncError(nt, status.ManagementConflictErrorCode, "declared in another repository"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first RootSync, it should show the denied the request error.

pkg/parse/root.go Show resolved Hide resolved
pkg/parse/root.go Show resolved Hide resolved
func UpdateConflictManagerStatus(ctx context.Context, conflictErrs []status.ManagementConflictError, k8sClient client.Client) {
// reportRootSyncConflicts reports conflicts to the RootSync that manages the
// conflicting resources.
func reportRootSyncConflicts(ctx context.Context, k8sClient client.Client, conflictErrs []status.ManagementConflictError) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

klog.Infof("No need to notify namespace reconciler for the conflict")
continue
}
if err := prependRootSyncRemediatorStatus(ctx, k8sClient, name, conflictErrors, defaultDenominator); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original intent was to make the conflict easy to be discovered by both RootSyncs. It is possible that the syncing process takes 10-20 minutes and the RootSync remains syncing without any issue. The cross-posted conflict error reports the conflicts earlier.

After talking with Karl, he will remove the cross-posted error in a separate PR after adding error merging to the applier.

- Add a new setSyncStatus that handles:
  - Updating the cached sync status in `state`
  - Checking the cached sync status to avoid unnecessary updates
  - Calling SetSyncStatus with the new sync errors
  - Calling reportRootSyncConflicts to update remote RootSyncs, if
    they are causing conflicts with the current RSync.
- Rename UpdateConflictManagerStatus -> reportRootSyncConflicts
- Rename parse.setXStatus -> setXStatusFields to allow the name
  setSyncStatus to be used by the wrapper that handles state
- Call the new parse.setSyncStatus to make all the updates
  consistent, instead of calling parser.SetSyncStatus directly.
- Rename updateSyncStatus -> updateSyncStatusPeriodically to
  indicate it handles retrying until cancelled.
- Rename Parser.ApplierErrors() -> SyncErrors() and have it
  delegate to the new updater.Errors(), which centralizes
  aggregation of the following errors that are all reported as
  sync errors:
  - Remediator conflict errors
  - Validation errors from updating the object set
  - Applier errors
  - Watch update errors
- Remove Parser.RemediatorConflictErrors() because it's obsolete
  with the new SyncErrors. The conflict errors are a subset of the
  sync errors and are extracted as-needed in setSyncErrors.
  This extraction probably won't be necessary in the future, as we
  plan to stop reporting errors across RSyncs.
- Enhance TestConflictingDefinitions_RootToRoot to handle catching
  conflict errors in parallel, for speed and consistently.
- Enahance TestDontDeleteAllNamespaces to use WatchForObject
- Fix linter copylock error

Change-Id: I79aa32d22080bc91a4262279de0b490974564112
Copy link
Contributor

@nan-yu nan-yu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nan-yu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 33c1322 into GoogleContainerTools:main Nov 15, 2022
@karlkfi karlkfi deleted the karl-combo-sync-errors branch November 15, 2022 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants