Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BackupController: do as much as possible #250

Merged

Conversation

ncdc
Copy link
Contributor

@ncdc ncdc commented Dec 14, 2017

When running a backup, try to do as much as possible, collecting errors
along the way, and return an aggregate at the end. This way, if a backup
fails for most reasons, we'll be able to upload the backup log file to
object storage, which wasn't happening before.

Signed-off-by: Andy Goldstein andy.goldstein@gmail.com

@ncdc ncdc added this to the v0.7.0 milestone Dec 14, 2017
@ncdc ncdc force-pushed the backup-controller-do-as-much-as-possible branch from 6740070 to db22f6a Compare December 19, 2017 17:16
@ncdc
Copy link
Contributor Author

ncdc commented Dec 19, 2017

@skriss PTAL at these changes and let me know if you think this is minimally sufficient. If so, I'll fix the tests.

@skriss
Copy link
Contributor

skriss commented Dec 19, 2017

Changes LGTM. 2 peripheral things to think about - I think the backup service's GetAllBackups will now log an error any time it's called if there are failed backup logs in obj store (although the func should still return valid backups without error); and also, in the current state, I don't think failed backup logs will ever get GC'ed. I think with my changes to GC in #252, though, they will.

@ncdc
Copy link
Contributor Author

ncdc commented Dec 19, 2017

I think the backup service's GetAllBackups will now log an error any time it's called if there are failed backup logs in obj store (although the func should still return valid backups without error)

Is it worth uploading the metadata for Failed backups? That would mean they would sync from object storage to kube.

in the current state, I don't think failed backup logs will ever get GC'ed. I think with my changes to GC in #252, though, they will.

I'm ok putting this PR in knowing that yours will take care of that.

@skriss
Copy link
Contributor

skriss commented Dec 19, 2017

I don't see any downside to uploading the metadata - do you? I think it makes sense. I guess the alternative could be to put failed backup logs in a special top-level dir that we ignore when listing backups.

@ncdc
Copy link
Contributor Author

ncdc commented Dec 19, 2017

I'll go ahead an upload it

@ncdc ncdc force-pushed the backup-controller-do-as-much-as-possible branch from db22f6a to 8ef213a Compare December 20, 2017 16:13
@ncdc
Copy link
Contributor Author

ncdc commented Dec 20, 2017

@skriss ready for review. @nrb would be nice if you could run this through e2e :)

@nrb
Copy link
Contributor

nrb commented Dec 20, 2017

I'll give it a shot this afternoon

When running a backup, try to do as much as possible, collecting errors
along the way, and return an aggregate at the end. This way, if a backup
fails for most reasons, we'll be able to upload the backup log file to
object storage, which wasn't happening before.

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
@ncdc ncdc force-pushed the backup-controller-do-as-much-as-possible branch from 8ef213a to 1e581f1 Compare December 20, 2017 19:49
@nrb
Copy link
Contributor

nrb commented Dec 20, 2017

e2e test on my local machine passed.

Relevant console output snipped:

customresourcedefinition "backups.ark.heptio.com" created
customresourcedefinition "schedules.ark.heptio.com" created
customresourcedefinition "restores.ark.heptio.com" created
customresourcedefinition "configs.ark.heptio.com" created
customresourcedefinition "downloadrequests.ark.heptio.com" created
namespace "heptio-ark" created
serviceaccount "ark" created
clusterrolebinding "ark" created
role "ark" created
rolebinding "ark" created
creating user...
created user.
secret "cloud-credentials" created
config "default" created
deployment "ark" created
Running pre-test
Reading nginx example file...
Read nginx example file.
Ensuring _artifacts dir exists...
Ensured _artifacts dir exists.
Writing evaluated example file...
Wrote evaluated example file.
Applying example file...
namespace "nginx-example" created
deployment "nginx-deployment" created
service "my-nginx" created
Applied example file.
Running test
Creating backup...
Backup "nginx-backup" created successfully.
Created backup.
Waiting for backup to complete...
Backup completed.
Deleting nginx namespace & PV to simulate disaster...
namespace "nginx-example" deleted
Waiting for nginx namespace to terminate...
Namespace terminated.
Creating restore...
Restore "nginx-backup-20171220164846" created successfully.
Created restore.
Verifying nginx install...
deployments.apps "nginx-deployment" not found
Verified nginx install.
PASS: true
Running post-test
Fetching logs for namespace  heptio-ark
Ensuring log directory exists
-- Getting logs for ark-5b775789d4-hcvjq
Creating  _artifacts/logs/ark-5b775789d4-hcvjq.log
Wrote log  _artifacts/logs/ark-5b775789d4-hcvjq.log
Fetching logs for namespace  nginx-example
Ensuring log directory exists
-- Getting logs for nginx-deployment-569477d6d8-knhsq
Creating  _artifacts/logs/nginx-deployment-569477d6d8-knhsq.log
Wrote log  _artifacts/logs/nginx-deployment-569477d6d8-knhsq.log
-- Getting logs for nginx-deployment-569477d6d8-shzkh
Creating  _artifacts/logs/nginx-deployment-569477d6d8-shzkh.log
Wrote log  _artifacts/logs/nginx-deployment-569477d6d8-shzkh.log
namespace "nginx-example" deleted
backup "nginx-backup" deleted
restore "nginx-backup-20171220164846" deleted
namespace "heptio-ark" deleted
Waiting for Ark namespace to delete...
Ark namespace deleted.
Deleting Ark IAM user...

The deployments.apps "nginx-deployment" not found output comes from how diluvian's looping to verify the deployment; it will keep looking til a timeout happens and fail if it wasn't found. So I'm not too concerned about that line unless others are.

The branch of the e2e tests I ran didn't grab the backup logs, but I can look at adding that tomorrow.

// upload tar file
if err := br.seekAndPutObject(bucket, getBackupContentsKey(backupName), backup); err != nil {
// try to delete the metadata file since the data upload failed
deleteErr := br.objectStore.DeleteObject(bucket, metadataKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably makes sense to leave the metadata file here, right? For the same reason that we're uploading it for failed backups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was planning to ask you about that. I'm happy to remove this code!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle the situation where the backup completed successfully, we were able to upload the metadata, but uploading the tarball failed for some reason? What you'd see is a completed backup, with logs, but no ability to restore it... TODO for later, or fix now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you see Failed on the API object? this func would return an error to the controller, and then runBackup would return an error to processBackup which would log it and mark it as failed. If this is true, I think it's still not ideal but reasonably obvious enough that no further changes would be needed for now,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my scenario, the json file in object storage has Completed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, but the backup obj in etcd has Failed, right? so ark backup get would show Failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to go back through the various places where status is changed to failed to confirm. Also, if you were to sync from object storage into a new kube cluster, it would come in as completed...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true. idk, what do you think makes sense? we could remove everything from obj storage on error, or re-upload metadata with a Failed status, or...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on where to leave this for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot about this thread. Let me page it back in and think about it.

metadata: newStringReadSeeker("foo"),
metadataError: errors.New("md"),
log: newStringReadSeeker("baz"),
expectedErr: "md",
},
{
name: "error on data upload deletes metadata",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see prev comment re: whether to delete metadata file in this case

@ncdc
Copy link
Contributor Author

ncdc commented Dec 21, 2017

Don't merge yet... with 1 small change locally, this is what you see in a backup's log if you have a plugin that returns an error on every single item in the backup:

[snip]

time="2017-12-21T16:47:04Z" level=info msg="Backing up resource" backup=heptio-ark/fail5 group=v1 groupResource=secrets logSource="pkg/backup/item_backupper.go:163" name=default-token-s5wfb namespace=heptio-ark
time="2017-12-21T16:47:04Z" level=info msg="Executing custom action" backup=heptio-ark/fail5 group=v1 groupResource=secrets logSource="pkg/backup/item_backupper.go:188" name=default-token-s5wfb namespace=heptio-ark
time="2017-12-21T16:47:04Z" level=info msg="Hello from MyPlugin!" backup=heptio-ark/fail5 group=v1 groupResource=secrets logSource="/go/src/github.com/ncdc/ark-backupitemaction-always-fail/ark-backupitemaction-always-fail/myplugin.go:42" name=default-token-s5wfb namespace=heptio-ark pluginName=ark-backupitemaction-always-fail

[snip]

time="2017-12-21T16:47:05Z" level=info msg="Backup completed with errors: [error executing custom action: rpc error: code = Unknown desc = oh no!, error executing custom action: rpc error: code = Unknown desc = oh no!, error executing custom action: rpc error: code = Unknown desc = oh no!, error executing custom action: rpc error: code = Unknown desc = oh no!, error executing custom action: ......] backup=heptio-ark/fail5 logSource="pkg/backup/backup.go:253"

Some things to point out:

  1. You can see that my plugin executed (Hello from MyPlugin!) but you don't know that it failed (there's no immediate logging about the returned error)
  2. All the errors are aggregated into a giant array, with each error having the (rather uninformative) text error executing custom action: rpc error: code = Unknown desc = oh no!

Do you think it makes sense to log the succeeded/failed status after each custom action runs?

How do you think we should handle the error that the item backupper returns when a custom action fails? Should we include more details (group, resource, namespace, name, plugin name)? Just print out the description so we don't show the rpc details? What is the best way to show the errors to users to they're easy to find and discern?

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
@skriss
Copy link
Contributor

skriss commented Dec 21, 2017

Do you think it makes sense to log the succeeded/failed status after each custom action runs?

I'd say it makes sense to immediately log on an error (at the error level). I think logging success might make the log too noisy.

How do you think we should handle the error that the item backupper returns when a custom action fails? Should we include more details (group, resource, namespace, name, plugin name)? Just print out the description so we don't show the rpc details? What is the best way to show the errors to users to they're easy to find and discern?

I think it makes sense to include more context in the errors (can Wrapf the error from the custom action with the additional fields). I'm not overly concerned about the RPC details as long as we include the additional context.

@nrb
Copy link
Contributor

nrb commented Dec 21, 2017

I'd say it makes sense to immediately log on an error (at the error level). I think logging success might make the log too noisy.

Agreed.

@ncdc ncdc changed the title BackupController: do as much as possible [WIP] BackupController: do as much as possible Dec 22, 2017
If a backup item action errors, log the error as soon as it occurs, so
it's clear when the error happened. Also include information about the
groupResource, namespace, and name of the item in the error.

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
@ncdc ncdc changed the title [WIP] BackupController: do as much as possible BackupController: do as much as possible Jan 3, 2018
@skriss
Copy link
Contributor

skriss commented Jan 3, 2018

LGTM.

@skriss skriss merged commit 656428d into vmware-tanzu:master Jan 3, 2018
@ncdc ncdc deleted the backup-controller-do-as-much-as-possible branch March 5, 2018 15:58
dymurray pushed a commit to dymurray/velero that referenced this pull request Jul 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants