Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare for duplicates before logging object exists errors #405

Merged
merged 2 commits into from
Apr 11, 2018

Conversation

nrb
Copy link
Contributor

@nrb nrb commented Mar 29, 2018

When restoring resources that raise an already exists error, check their
equality before logging a message on the restore. If they're the same
except for some metadata, don't generate a message.

The restore process was modified so that if an object had an empty
namespace string, no namespace key is created on the object. This was to
avoid manipulating the copy of the current cluster's object by adding
the target namespace.

There are some cases right now that are known to not be equal via this
method:

  • The default ServiceAccount in a namespace will not match, primarily
    because of differing default tokens. These will be handled in their own
    patch
  • IP addresses for Services are recorded in the backup object, but are
    either not present on the cluster object, or different. An issue for
    this already exists at Should service action (restore) remove all node ports from a service? #354
  • Endpoints have differing values for renewTime. This may be
    insubstantial, but isn't currently handled by the resetMetadataAndStatus
    function.
  • PersistentVolume objects in the backup have a claimRef section,
    while those in cluster do not.

Signed-off-by: Nolan Brubaker nolan@heptio.com

@nrb nrb added this to the v0.8.0 milestone Mar 29, 2018
@nrb nrb requested a review from skriss March 29, 2018 19:00
if equality.Semantic.DeepEqual(obj, fromCluster) {
continue
} else {
addToResult(&warnings, namespace, err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skriss Should we perhaps wrap or append to the error message here to say that the object exists and differs from the backup, so it's more clear what's happened?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah probably makes sense to wrap


// add an ark-restore label to each resource for easy ID
addLabel(obj, api.RestoreLabelKey, ctx.restore.Name)

ctx.infof("Restoring %s: %v", obj.GroupVersionKind().Kind, obj.GetName())
_, err = resourceClient.Create(obj)
if apierrors.IsAlreadyExists(err) {
addToResult(&warnings, namespace, err)
continue
fromCluster, _ := resourceClient.Get(obj.GetName(), metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to handle error

continue
fromCluster, _ := resourceClient.Get(obj.GetName(), metav1.GetOptions{})
// Remove insubstantial metadata
fromCluster, _ = resetMetadataAndStatus(fromCluster)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to handle error

addLabel(fromCluster, api.RestoreLabelKey, ctx.restore.Name)

if equality.Semantic.DeepEqual(obj, fromCluster) {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably makes sense to do a ctx.infof(...) here stating that the object already exists with matching state in-cluster

if equality.Semantic.DeepEqual(obj, fromCluster) {
continue
} else {
addToResult(&warnings, namespace, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah probably makes sense to wrap

@nrb nrb force-pushed the ignore-duplicates branch 3 times, most recently from 3254d25 to 990be21 Compare April 4, 2018 20:59
@nrb nrb removed this from the v0.8.0 milestone Apr 4, 2018
@@ -427,6 +428,41 @@ func (ctx *context) restoreFromDir(dir string) (api.RestoreResult, api.RestoreRe
return warnings, errs
}

// resolveConflict tries to resolve conflicts between a backed up object
// and it's representation in the cluster, deciding on what action to take
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its

@@ -427,6 +428,41 @@ func (ctx *context) restoreFromDir(dir string) (api.RestoreResult, api.RestoreRe
return warnings, errs
}

// resolveConflict tries to resolve conflicts between a backed up object
// and it's representation in the cluster, deciding on what action to take
// based on type. Errors are wrapped with some additional data and returned to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need the comment about errors?

// and it's representation in the cluster, deciding on what action to take
// based on type. Errors are wrapped with some additional data and returned to
// the caller to handle. If the items are equivalent, nil is returned
func (ctx *context) resolveConflict(fromBackup *unstructured.Unstructured, resourceClient client.Dynamic, restoreErr error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a small function that takes 2 objects and tells us if they're equal. Ideally it should just return a bool, although it may need to return (bool, error) because resetMetadataAndStatus can return an error.

We could have another function that retrieves fromCluster and then calls the helper above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Perhaps resolveConflict would be the top level that gets fromCluster, cleans it up, and passes it down?

// Do some initial work to get the cluster version for comparison
fromCluster, err := resourceClient.Get(fromBackup.GetName(), metav1.GetOptions{})
if err != nil {
es := fmt.Sprintf("Conflict found for %s: %v, but could not retrieve cluster verison: %s", fromBackup.GroupVersionKind(), fromBackup.GetName(), err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: if you need to format an error string, do errors.Errorf

if err != nil {
es := fmt.Sprintf("Conflict found for %s: %v, but could not retrieve cluster verison: %s", fromBackup.GroupVersionKind(), fromBackup.GetName(), err.Error())
err = errors.New(es)
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably omit the format string and just return errors.WithStack(err)

fromCluster, err = resetMetadataAndStatus(fromCluster)
if err != nil {
es := fmt.Sprintf("Conflict found for %s: %v, but could not reset metadata for cluster verison: %s", fromBackup.GroupVersionKind(), fromBackup.GetName(), err.Error())
err = errors.New(es)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably omit the format string and just return err

if !equality.Semantic.DeepEqual(fromBackup, fromCluster) {
es := fmt.Sprintf("%s and backup does not match.", restoreErr.Error())
err := errors.New(es)
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I'd return false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you return false, nil with a (bool, err) return signature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes


// add an ark-restore label to each resource for easy ID
addLabel(obj, api.RestoreLabelKey, ctx.restore.Name)

ctx.infof("Restoring %s: %v", obj.GroupVersionKind().Kind, obj.GetName())
_, err = resourceClient.Create(obj)
if apierrors.IsAlreadyExists(err) {
addToResult(&warnings, namespace, err)
err = ctx.resolveConflict(obj, resourceClient, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this returns (bool, err), I'd store it in a different error to differentiate from the already-exists error. I would log the error that comes back from this call, if we get one. Then, if we got an error here, or the bool return value is false (objects not equal), call addToResult.

Here's my thinking: if we run into any problems trying to determine equality, it's better to record those problems in the restore log file, and record the conflict in the restore result warnings. Would you agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I would - I actually had resolveConflict returning (bool, err) initially, but changed it to error only as I was only switching on error states. I was also doing _, restoreErr := resourceClient.Create(obj) in order to keep it separate.

I'll return to that structure.

equal, err := ctx.objectsAreEqual(obj, resourceClient, restoreErr)
// Log any errors trying to resolve the conflict; nil means no conflic
if err != nil {
addToResult(&warnings, namespace, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be ctx.infof so it goes into the restore log file.

addToResult(&warnings, namespace, err)
}
if equal {
ctx.infof("Backup for %s: %v matches cluster version, skipping restore.", obj.GroupVersionKind(), obj.GetName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need this log line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not particularly attached to it. Fine with deleting it.

if equal {
ctx.infof("Backup for %s: %v matches cluster version, skipping restore.", obj.GroupVersionKind(), obj.GetName())
} else {
e := fmt.Errorf("%s and backup does not match.", restoreErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this error text end up looking like - do you have an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a snippet from a restore:

Warnings:
  Ark:        <none>
  Cluster:  persistentvolumes "pvc-c5c8bb8c-3758-11e8-a959-42010a960060" already exists and backup does not match.
            persistentvolumes "pvc-f07ce647-336d-11e8-9d79-42010a96006f" already exists and backup does not match.
  Namespaces:
    default:        services "kubernetes" already exists and backup does not match.
    kube-system:    endpoints "kube-controller-manager" already exists and backup does not match.
                    endpoints "kube-scheduler" already exists and backup does not match.
                    services "default-http-backend" already exists and backup does not match.
                    services "heapster" already exists and backup does not match.
                    services "kube-dns" already exists and backup does not match.
                    services "kubernetes-dashboard" already exists and backup does not match.
    nginx-example:  services "my-nginx" already exists and backup does not match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

continue
}
if err != nil {
if restoreErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment indicating that if we get here, we got some error other than already-exists

addLabel(fromCluster, api.RestoreLabelKey, restoreName)

// If there are no specific actions needed based on the type, simply check for equality.
if !equality.Semantic.DeepEqual(fromBackup, fromCluster) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return equality.Semantic.DeepEqual(fromBackup, fromCluster), nil

@@ -427,6 +428,18 @@ func (ctx *context) restoreFromDir(dir string) (api.RestoreResult, api.RestoreRe
return warnings, errs
}

// objectsAreEqual indentifies whether a backed up object
// and its representation in the cluster are equal
func (ctx *context) objectsAreEqual(fromBackup *unstructured.Unstructured, resourceClient client.Dynamic, restoreErr error) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to pass in restoreErr any more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for going back and forth on this, but I'm wondering if it would just be clearer to move the Get() into the IsAlreadyExists block and get rid of this helper. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My very first pass did that. The reason I named this function resolveConflict was because I envisioned adding more complex cases like service accounts there later. That way, the resolution was mostly out of the IsAlreadyExists block.

Doing the Get() in the block seems reasonable to me, and some function that contains different actions or policy checks added when it's needed.

_, restoreErr := resourceClient.Create(obj)
if apierrors.IsAlreadyExists(restoreErr) {
fromCluster, err := resourceClient.Get(obj.GetName(), metav1.GetOptions{})
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of something like this:

equal := false

if fromCluster, err := resourceClient.Get(obj.GetName(), metav1.GetOptions{}); err == nil {
  equal, err = objectsAreEqual(fromCluster, obj)
  if err != nil {
    // Log any errors trying to check equality
    ctx.infof(err.Error())
  }
} else {
  ctx.infof("error getting from-cluster...")
}

if !equal {
  e := fmt.Errorf("%s and backup does not match.", restoreErr)
  addToResult(&warnings, namespace, e)
}

This way, we only call objectsAreEqual if we successfully retrieve fromCluster. No matter what, if we get an error anywhere in here, it gets logged, and we add the warning in all cases unless the objects are equal. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense.

}

for _, test := range tests {
res, err := objectsAreEqual(test.clusterObj, test.backupObj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run in a subtest: t.Run(test.name, func(t *testing.T) { ... })

@@ -897,6 +936,15 @@ func (obj *testUnstructured) WithName(name string) *testUnstructured {
return obj.WithMetadataField("name", name)
}

func (obj *testUnstructured) WithArkLabel(restoreName string) *testUnstructured {
ls := make(map[string]string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would check if the obj already has labels and merge if it does.

clusterObj: NewTestUnstructured().Unstructured,
expectedErr: true,
expectedRes: false,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth getting the json output from a real cluster of something like a secret or service or pod and putting that in here as a test case. We have a test helper to convert json to Unstructured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@@ -722,6 +740,23 @@ func (ctx *context) executePVAction(obj *unstructured.Unstructured) (*unstructur
return updated2, nil
}

// objectsAreEqual takes two unstructured objects and checks for equality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a comment that fromCluster is mutated

ctx.infof("Error retrieving cluster version of %s: %v", obj.GetName(), err)
}
if !equal {
e := fmt.Errorf("%s and backup does not match.", restoreErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to use errors.Errorf (from github.com/pkg/errors) wherever possible.

@nrb nrb force-pushed the ignore-duplicates branch 2 times, most recently from e0b31b5 to 1e84cae Compare April 10, 2018 15:28
@nrb
Copy link
Contributor Author

nrb commented Apr 10, 2018

@ncdc @skriss Latest comments addressed, PTAL when you can.

Signed-off-by: Nolan Brubaker <nolan@heptio.com>
Copy link
Contributor

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking pretty good to me

equal, err = objectsAreEqual(fromCluster, obj)
// Log any errors trying to check equality
if err != nil {
ctx.infof(err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably make this something like ctx.infof("error checking %s for equality: %v", obj.getName(), err)

ctx.infof("Error retrieving cluster version of %s: %v", obj.GetName(), err)
}
if !equal {
e := errors.Errorf("%s and backup does not match.", restoreErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wordsmithing here, but since this is directly part of the UX it's important to make this super clear - maybe something like " not restored: , and is different than backed-up version." I think it's important to make clear that it wasn't restored, because it already exists, and there's a diff (so the user might want to manually remediate).

@@ -722,6 +740,25 @@ func (ctx *context) executePVAction(obj *unstructured.Unstructured) (*unstructur
return updated2, nil
}

// objectsAreEqual takes two unstructured objects and checks for equality.
// The fromCluster object is copy that is mutated to remove any insubstantial runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove is copy that

When restoring resources that raise an already exists error, check their
equality before logging a message on the restore. If they're the same
except for some metadata, don't generate a message.

The restore process was modified so that if an object had an empty
namespace string, no namespace key is created on the object. This was to
avoid manipulating the copy of the current cluster's object by adding
the target namespace.

There are some cases right now that are known to not be equal via this
method:

- The `default` ServiceAccount in a namespace will not match, primarily
because of differing default tokens. These will be handled in their own
patch
- IP addresses for Services are recorded in the backup object, but are
either not present on the cluster object, or different. An issue for
this already exists at vmware-tanzu#354
- Endpoints have differing values for `renewTime`. This may be
insubstantial, but isn't currently handled by the resetMetadataAndStatus
function.
- PersistentVolume objects do not match on spec fields, such as
claimRef and cloud provider persistent disk info

Signed-off-by: Nolan Brubaker <nolan@heptio.com>
@skriss
Copy link
Contributor

skriss commented Apr 10, 2018

LGTM!

@ncdc ncdc merged commit 0998f6d into vmware-tanzu:master Apr 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants