Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from finalizer to DeleteBackupRequest for deleting backups #383

Merged
merged 13 commits into from
Apr 6, 2018

Conversation

ncdc
Copy link
Contributor

@ncdc ncdc commented Mar 15, 2018

We ran into a lot of problems using a finalizer on the backup to allow
the Ark server to clean up all associated backup data when deleting a
backup.

Users also found it less than desirable that deleting the heptio-ark
namespace resulted in all the backup data being deleted.

This removes the finalizer and replaces it with an explicit
DeleteBackupRequest that is created as a means of requesting the
deletion of a backup and all its associated data. This is what ark backup delete does.

If you use kubectl to delete a backup or to delete the heptio-ark
namespace, this no longer deletes associated backups. Additionally, as
long as the heptio-ark namespace still exists, the Ark server's
BackupSyncController will continually sync backups into the heptio-ark
namespace from object storage.

TODO:

  • Modify ark backup describe to show DeleteBackupRequest status if relevant
  • Write one-shot code to run at server startup to remove our old backup finalizer if it's there
  • Update unit tests

Fixes #375
Fixes #376
Fixes #358

@ncdc
Copy link
Contributor Author

ncdc commented Mar 15, 2018

@skriss @nrb I haven't updated tests yet - wanted to publish this now so you can start reviewing the logic in the delete and gc controllers.

@ncdc
Copy link
Contributor Author

ncdc commented Mar 15, 2018

I also think it would be worth it to add some code that runs once each time the server starts, removing the ark GC finalizer from any backups that have it.

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

func NewDeleteBackupRequest(name string) *v1.DeleteBackupRequest {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: godoc


deleteBackupRequestInformer.Informer().AddEventHandler(
cache.ResourceEventHandlerFuncs{
AddFunc: c.enqueue,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: need an UpdateFunc (e.g. request is created and modified before the controller sees it, status is still ""/New)

},
syncPeriod,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: switch to resyncFunc in genericController

name string
queue workqueue.RateLimitingInterface
logger logrus.FieldLogger
syncHandler func(key string) error
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add resyncFunc

}

// Try to delete restores
log.Info("Removing restores")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should remove restores when explicitly deleting a backup, but I have seen others mention that retaining restores as a matter of record would be beneficial. The restores would then be cleaned up via the GC controller. I think that's also a reasonable approach, wdyt?

I don't think this should be an option, though, we should only do one or the other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing now that the GC controller uses the BackupDeleteRequests to do a delete, so I suppose the 2nd approach would require shuffling some logic around. I'm fine with it as is.

Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial pass

Gopkg.toml Outdated
@@ -31,6 +31,11 @@ required = [
"k8s.io/code-generator/cmd/informer-gen",
]

[prune]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object

// DeleteBackupRequestList is a list of DeleteBackupRequests.
type DeleteBackupRequestList struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question - why isn't there a code-generator for the *List types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not really worth the effort

return &v1.DeleteBackupRequest{
ObjectMeta: metav1.ObjectMeta{
GenerateName: name + "-",
Labels: map[string]string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the backup name stored in a label in addition to the spec just for ease of filtering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so we can delete all requests for backup x

}

func (o *DeleteOptions) Validate(c *cobra.Command, args []string, f client.Factory) error {
if len(args) != 1 {
return errors.New("you must specify only one argument, the backup's name")
}

kubeClient, err := f.KubeClient()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to validate that the specified backup exists here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

func (c *backupDeletionController) processRequest(req *v1.DeleteBackupRequest) error {
log := c.logger.WithFields(logrus.Fields{
"namespace": req.Namespace,
"name": req.Spec.BackupName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the name field be req.Spec.Name? Maybe add that and the backup name as fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do


}()

c.logger.Info("Starting %s", c.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would probably be useful to add the controller name as a field to the c.logger

README.md Outdated
@@ -135,28 +135,6 @@ ark restore describe <RESTORE_NAME>

For more information, see [the debugging information][18].

### Clean up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe leave this section, and specify that you can just delete the NS, and note that backups/snapshots will not be deleted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll make it clearer that this is optional and only if you want to remove stuff.

@@ -115,7 +115,7 @@ Now you need to create a Secret that contains all the seven environment variable

```bash
kubectl create secret generic cloud-credentials \
--namespace <ARK_SERVER_NAMESPACE> \
--namespace <ARK_SERVER_ARK_NAMESPACE> \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARK_NAMESPACE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@@ -47,7 +47,7 @@ Create a Secret. In the directory of the credentials file you just created, run:

```bash
kubectl create secret generic cloud-credentials \
--namespace <ARK_SERVER_NAMESPACE> \
--namespace <ARK_SERVER_ARK_NAMESPACE> \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARK_NAMESPACE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks


To run the server in another namespace, you edit the relevant files, changing `heptio-ark-server` to
To run the server in another namespace, you edit the relevant files, changing `heptio-ark` to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing not explicitly called out here is that the cloud-credentials secret also needs to be created in the alternate namespace. It's templatized in each cloud-provider's walkthrough, but maybe worth just referencing it here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

r.Status.Phase = v1.DeleteBackupRequestPhaseProcessed
r.Status.Errors = []string{"unable to delete backup because it includes PV snapshots and Ark is not configured with a PersistentVolumeProvider"}
})
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return err

@ncdc ncdc force-pushed the delete-backup-request branch 2 times, most recently from f06cb35 to 1fe9dea Compare March 15, 2018 20:18
@skriss
Copy link
Member

skriss commented Mar 16, 2018

Some observations from testing:

  • all of the normal-case functionality looks good (removal of finalizer from old backups, ark backup delete, garbage-collection)
  • the warning message when running ark backup delete foo without the --confirm flag is misleading (references force-deletion, orphaning cloud resources)
  • in Azure, if I manually delete a snapshot before deleting a backup that references it, Ark doesn't error and deletes the backup (I think the outcome is OK, but I would've expected an error. It may be that the Azure API doesn't return an error if you try to delete a non-existent snapshot?)
  • you were going to add a check for non-existent backups in the command code, looks like that's not there yet
  • when running ark backup delete for a non-existent backup, the Ark server log has this error:
    time="2018-03-16T16:57:58Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=backup-deletion error="error patching DeleteBackupRequest: Object 'Kind' is missing in '{\"object\":{\"apiVersion\":\"ark.heptio.com/v1\",\"kind\":\"DeleteBackupRequest\",\"metadata\":{\"clusterName\":\"\",\"creationTimestamp\":\"2018-03-16T16:57:58Z\",\"deletionGracePeriodSeconds\":null,\"deletionTimestamp\":null,\"generateName\":\"nginx-\",\"generation\":0,\"labels\":{\"ark.heptio.com/backup-name\":\"nginx\"},\"name\":\"nginx-3hp7g\",\"namespace\":\"heptio-ark\",\"resourceVersion\":\"1049234\",\"selfLink\":\"\",\"uid\":\"2ddbb8b4-293b-11e8-bf7d-0a58ac1f100c\"},\"spec\":{\"backupName\":\"nginx\"},\"status\":{\"errors\":[\"backup not found\"],\"phase\":\"Processed\"}}}'" error.file="backup_deletion_controller.go:259" error.function="(*backupDeletionController).patchDeleteBackupRequest" key=heptio-ark/nginx-3hp7g logSource="pkg/controller/generic_controller.go:136"
    
  • if I take a backup with a snapshot, then remove the persistentVolumeProvider from config, then try to delete the backup, the failed delete request doesn't show up in ark backup describe -- looks like this is because we only show the associated requests if the backup is in phase Deleting, which the backup is not in this case
  • if I try to delete a non-existent backup foo, then later I actually create a backup foo, when I go to delete backup foo I now see the failed backup requests from prior to the backup existing in ark backup describe. Maybe they should be linked by UID rather than name?

@ncdc
Copy link
Contributor Author

ncdc commented Mar 16, 2018

the warning message when running ark backup delete foo without the --confirm flag is misleading (references force-deletion, orphaning cloud resources)

Fixed

in Azure, if I manually delete a snapshot before deleting a backup that references it, Ark doesn't error and deletes the backup (I think the outcome is OK, but I would've expected an error. It may be that the Azure API doesn't return an error if you try to delete a non-existent snapshot?)

I'm not sure an error is appropriate here. Your intent is to delete a backup and all its associated data. If we can't find some of the associated data, that's not really an issue, is it? I would expect it to be a problem if you wanted to restore, of course.

Maybe we need something as part of the backup sync controller that periodically verifies the integrity of a backup's associated data?

you were going to add a check for non-existent backups in the command code, looks like that's not there yet

That's pushed up now.

when running ark backup delete for a non-existent backup, the Ark server log has this error:

I manually created a DeleteBackupRequest with what should be the same as what ark backup delete does (since I now have the check for nonexistent backups), and I can't reproduce the patch error. Would you have some time to try to reproduce & debug?

if I take a backup with a snapshot, then remove the persistentVolumeProvider from config, then try to delete the backup, the failed delete request doesn't show up in ark backup describe -- looks like this is because we only show the associated requests if the backup is in phase Deleting, which the backup is not in this case

So maybe always show if at least 1 has errors?

if I try to delete a non-existent backup foo, then later I actually create a backup foo, when I go to delete backup foo I now see the failed backup requests from prior to the backup existing in ark backup describe. Maybe they should be linked by UID rather than name?

I'll link on both name and uid.

This brings up something I haven't addressed yet - there's no expiration on DeleteBackupRequests nor any good way to clean them up (other than successfully deleting a backup).

@skriss
Copy link
Member

skriss commented Mar 16, 2018

in Azure, if I manually delete a snapshot before deleting a backup that references it, Ark doesn't error and deletes the backup (I think the outcome is OK, but I would've expected an error. It may be that the Azure API doesn't return an error if you try to delete a non-existent snapshot?)

I'm not sure an error is appropriate here. Your intent is to delete a backup and all its associated data. If we can't find some of the associated data, that's not really an issue, is it? I would expect it to be a problem if you wanted to restore, of course.

Yeah, I think it's fine as-is.

Maybe we need something as part of the backup sync controller that periodically verifies the integrity of a backup's associated data?

Would be a nice feature to add, but not needed as part of this PR.

when running ark backup delete for a non-existent backup, the Ark server log has this error:

I manually created a DeleteBackupRequest with what should be the same as what ark backup delete does (since I now have the check for nonexistent backups), and I can't reproduce the patch error. Would you have some time to try to reproduce & debug?

Yep, I'll look into it

if I take a backup with a snapshot, then remove the persistentVolumeProvider from config, then try to delete the backup, the failed delete request doesn't show up in ark backup describe -- looks like this is because we only show the associated requests if the backup is in phase Deleting, which the backup is not in this case

So maybe always show if at least 1 has errors?

I think that makes sense. Makes more sense to me than putting the backup in Deleting.

This brings up something I haven't addressed yet - there's no expiration on DeleteBackupRequests nor any good way to clean them up (other than successfully deleting a backup).

Yeah - probably not urgent but would be a good idea to GC old ones, esp. those that aren't associated with any existing backup.

@skriss
Copy link
Member

skriss commented Mar 27, 2018

I can no longer produce the patch error - attempted deletion of a nonexistent backup is blocked by the CLI, and applying the resource directly works fine.

@ncdc ncdc force-pushed the delete-backup-request branch 3 times, most recently from 3960c93 to 54dabd3 Compare April 4, 2018 20:20
@ncdc
Copy link
Contributor Author

ncdc commented Apr 4, 2018

@skriss added deletion of processed delete backup requests > 24h

ncdc added 9 commits April 4, 2018 16:44
- Add pruning settings to Gopkg.toml
- Update vendoring deps doc to point to dep installation instructions
  and to use dep instead of hack/dep-save.sh
- Remove hack/dep-save.sh

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Now that we've configured pruning for dep, this removes all unused
packages, all non-go files, and all tests from the vendor directory.

NOTE: due to a change in dep, it preserves anything that looks like a
license file. We'll be pulling in a few files we weren't previously
using - mostly license files. It's easier to just go with what dep does
than to try to exclude them after the fact.

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Use a custom builder image to do all of Ark's builds. This image now
contains k8s.io/code-generator for code generation.

Enable docker in travis to use the builder image.

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
We ran into a lot of problems using a finalizer on the backup to allow
the Ark server to clean up all associated backup data when deleting a
backup.

Users also found it less than desirable that deleting the heptio-ark
namespace resulted in all the backup data being deleted.

This removes the finalizer and replaces it with an explicit
DeleteBackupRequest that is created as a means of requesting the
deletion of a backup and all its associated data. This is what `ark
backup delete` does.

If you use kubectl to delete a backup or to delete the heptio-ark
namespace, this no longer deletes associated backups. Additionally, as
long as the heptio-ark namespace still exists, the Ark server's
BackupSyncController will continually sync backups into the heptio-ark
namespace from object storage.

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
@skriss
Copy link
Member

skriss commented Apr 5, 2018

Latest commit LGTM

ncdc added 2 commits April 5, 2018 14:21
Always request DeleteBackupRequests for a given backup so we can show
failed deletion attempts if you try to delete a backup that has PV
snapshots when Ark doesn't have a persistentVolumeProvider configured.

When creating a DeleteBackupRequest, include a label for the UID so we
can match based on name and UID when associated DeleteBackupRequests
with a given backup.

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Make sure a DeleteBackupRequest has its Spec.BackupName filled in. If
not, record an error in the status and mark the request as processed.

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
@ncdc ncdc changed the title [WIP] Switch from finalizer to DeleteBackupRequest for deleting backups Switch from finalizer to DeleteBackupRequest for deleting backups Apr 5, 2018
@ncdc ncdc added this to the v0.8.0 milestone Apr 5, 2018
@skriss
Copy link
Member

skriss commented Apr 5, 2018

last 2 also LGTM

When the BackupDeletionController processes a request, set the request's
backup-name and backup-uid labels if they aren't currently set.

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
@skriss
Copy link
Member

skriss commented Apr 5, 2018

2 things found testing:

  • deleting a backup while it's in progress is allowed. this could lead to snapshots getting orphaned if it's deleted after the snapshot is triggered but before it's recorded in the backup obj (I was able to make this happen). It's certainly an edge case, and blocking deletion of In Progress backups may not be desirable (e.g. if they get stuck there for some reason), so not sure if it makes sense to attempt to prevent or not.
  • I tested on GKE this time, and if I manually deleted a snapshot prior to ark backup deleteing the backup linked to the snapshot, I now get an error with the deletion because the GCP API returns a not found error from the DeleteSnapshot call. Per prior comments, this goes through without error on Azure (I think the Azure API doesn't return an error in this scenario). Haven't tried on AWS yet. It would be nice to have consistent behavior across clouds if possible.

Otherwise, all LGTM!

@skriss
Copy link
Member

skriss commented Apr 5, 2018

As a followup to the previous comment, the scenario in the 2nd bullet also errors on AWS.

@ncdc
Copy link
Contributor Author

ncdc commented Apr 6, 2018

deleting a backup while it's in progress is allowed. this could lead to snapshots getting orphaned if it's deleted after the snapshot is triggered but before it's recorded in the backup obj (I was able to make this happen). It's certainly an edge case, and blocking deletion of In Progress backups may not be desirable (e.g. if they get stuck there for some reason), so not sure if it makes sense to attempt to prevent or not.

The problem is knowing if an InProgress backup is truly in progress, or stuck. Maybe it's running, or maybe the Ark server was restarted in the middle of the backup.

We could add an in-memory list/map/set of in progress backups and disallow deleting those.

I tested on GKE this time, and if I manually deleted a snapshot prior to ark backup deleteing the backup linked to the snapshot, I now get an error with the deletion because the GCP API returns a not found error from the DeleteSnapshot call. Per prior comments, this goes through without error on Azure (I think the Azure API doesn't return an error in this scenario). Haven't tried on AWS yet. It would be nice to have consistent behavior across clouds if possible.

This requires that we change the protobuf messages to return better errors, so we can identify "not found" properly. I'd like to defer fixing this until we make the plugin/protobuf changes. That ok with you?

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
@ncdc
Copy link
Contributor Author

ncdc commented Apr 6, 2018

Latest commit keeps track of in-progress backups in memory and doesn't allow deletion of them.

@skriss
Copy link
Member

skriss commented Apr 6, 2018

OK, latest LGTM

@skriss skriss merged commit 25d46a7 into vmware-tanzu:master Apr 6, 2018
@ncdc ncdc deleted the delete-backup-request branch April 6, 2018 19:24
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants