Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically manage restic repos #606

Merged
merged 1 commit into from
Jun 27, 2018

Conversation

skriss
Copy link
Contributor

@skriss skriss commented Jun 25, 2018

Fixes #577

Builds on #570, ignore first commit.

Replaces ark restic repo init with automatic repository initialization, using a single static non-secret restic key.

@skriss skriss added this to the v0.9.0 milestone Jun 25, 2018
@skriss skriss requested review from ncdc and nrb June 25, 2018 16:37
return nil, errors.New("restic repository is not ready")
}
return repo, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might add a comment here that the rest of the function is for the apierrors.IsNotFound(err) == true case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@@ -85,7 +84,7 @@ func (r *restorer) RestorePodVolumes(restore *arkv1api.Restore, pod *corev1api.P
return nil
}

repo, err := getReadyRepo(r.repoLister, restore.Namespace, pod.Namespace)
repo, err := r.repoEnsurer.EnsureRepo(r.ctx, restore.Namespace, pod.Namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we ever end up creating a repo here? I'm thinking that even if we do, the data and buckets are in an invalid state already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the potential to create the restic repo. We shouldn't be doing that as part of a restore.

Copy link
Contributor Author

@skriss skriss Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How else would you handle the case where a user installs Ark in a fresh cluster and wants to restore an existing backup from another cluster with restic snapshots?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I wasn't thinking about the restore half of the equation (that's what happens when I try to wrap up a code review at the end of the day after a long meeting!).

func (b *backupper) BackupPodVolumes(backup *arkv1api.Backup, pod *corev1api.Pod, log logrus.FieldLogger) (map[string]string, []error) {
// get volumes to backup from pod's annotations
volumesToBackup := GetVolumesToBackup(pod)
if len(volumesToBackup) == 0 {
return nil, nil
}

repo, err := getReadyRepo(b.repoLister, backup.Namespace, pod.Namespace)
repo, err := b.repoEnsurer.EnsureRepo(b.ctx, backup.Namespace, pod.Namespace)
if err != nil {
return nil, []error{err}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is executed in the daemonset, correct?

I ran into an issue yesterday that was partially my fault, but probably warrants a log message. If the daemonset exists, but no pods in it are ready, the restic backup process will still proceed and then hang waiting for the daemonset pod to finish working. In my testing, it waited for over 5 minutes and couldn't be stopped with ctrl-c, only a kill -9.

I can probably replicate if needed to walk through this.

Copy link
Contributor

@ncdc ncdc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it through most of the PR but there's some files at the end I didn't have time for yet before EOD. But I wanted to submit now so you can see my comments on the repositoryEnsurer.

// warn if restic daemonset does not exist
if _, err := s.kubeClient.AppsV1().DaemonSets(s.namespace).Get(restic.DaemonSet, metav1.GetOptions{}); apierrors.IsNotFound(err) {
s.logger.Warn("Ark restic daemonset not found; restic backups/restores will not work until it's created")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to log a warning if we got some other error?

return []error{err}
}

ctx, cancelFunc := context.WithTimeout(context.Background(), time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extract the time.Minute as a const

}
}

func (r *repositoryEnsurer) onReady(name string) chan *arkv1api.ResticRepository {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would probably call this something else like newReadyChan

@@ -11,26 +29,27 @@ import (
const (
CredentialsSecretName = "ark-restic-credentials"
CredentialsKey = "ark-restic-credentials"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change the name of this key inside the secret, don't you think?

@@ -11,26 +29,27 @@ import (
const (
CredentialsSecretName = "ark-restic-credentials"
CredentialsKey = "ark-restic-credentials"
EncryptionKey = "static-passw0rd"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be exported?

secret := &corev1api.Secret{
ObjectMeta: metav1.ObjectMeta{
Namespace: namespace,
Name: CredentialsSecretName,
},
Type: corev1api.SecretTypeOpaque,
Data: map[string][]byte{
CredentialsKey: data,
CredentialsKey: []byte(EncryptionKey),
},
}

_, err := secretClient.Secrets(namespace).Create(secret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do a Get() first to avoid more of #281

return errors.New("timed out waiting for cache to sync")
}

repo, err := rm.repoEnsurer.EnsureRepo(ctx, rm.namespace, snapshot.Repo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the potential to create the restic repo. We shouldn't be doing that as part of deletions/GC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question re: spinning up a new cluster, then having GC process backups that were synced in from an existing bucket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, my bad

@@ -85,7 +84,7 @@ func (r *restorer) RestorePodVolumes(restore *arkv1api.Restore, pod *corev1api.P
return nil
}

repo, err := getReadyRepo(r.repoLister, restore.Namespace, pod.Namespace)
repo, err := r.repoEnsurer.EnsureRepo(r.ctx, restore.Namespace, pod.Namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the potential to create the restic repo. We shouldn't be doing that as part of a restore.

readyChans map[string]chan *arkv1api.ResticRepository
}

func newRepositoryEnsurer(repoInformer arkv1informers.ResticRepositoryInformer, repoClient arkv1client.ResticRepositoriesGetter) *repositoryEnsurer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend changing the logic so we create a new one of these per *restic.backupper. Then we don't need a lock or multiple channels and can instead just have a single channel.

I'm even tempted to say we can do away with the informer+channel. Move the initializeRepo functionality from *resticRepositoryController directly into the *restic.Backupper or this repositoryEnsurer. Then it's a lot simpler to implement EnsureRepo:

  1. Try to get the repo.
  2. If it exists & is ready, done.
  3. If it exists & isn't ready, wait (?) or call Check(). Return result of wait/check
  4. If it doesn't exist, try to create it
  5. If create succeeds, run the logic in initializeRepo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to avoid having multiple channels, we'd need to have one of these per call to BackupPodVolumes in the backupper, since a backup can have >1 namespace/repo. Not sure exactly what that would look like, since I don't think we'd want to be creating a new informer each time, but we can't be adding/removing event handlers while an existing one is running.

I'm not sure I like the idea of moving the repo init logic into the backupper or ensurer, but I think we need to talk it through in person.

Why don't you like what I did?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a look at it again with fresh eyes. I had to draw out the flow and the relationship between the components (ensurer vs repo controller). I was trying to find a way to have a single component responsible for the creation and initialization of the repo, instead of having that split across 2 (the ensurer that creates the ResticRepository, and the controller that initializes it in object storage and sets the phase to Ready).

@ncdc
Copy link
Contributor

ncdc commented Jun 27, 2018

@skriss does this look accurate for the EnsureRepo flow?

EnsureRepo flow

@skriss
Copy link
Contributor Author

skriss commented Jun 27, 2018

Looks right to me

@ncdc
Copy link
Contributor

ncdc commented Jun 27, 2018

Backup flow: backup flow

@ncdc
Copy link
Contributor

ncdc commented Jun 27, 2018

Delete snapshot flow (from backup deletion controller): delete snapshot flow

@ncdc
Copy link
Contributor

ncdc commented Jun 27, 2018

Daemonset pod backup flow: daemonset pod backup flow

@ncdc
Copy link
Contributor

ncdc commented Jun 27, 2018

Restore overview: restore overview

@ncdc
Copy link
Contributor

ncdc commented Jun 27, 2018

DaemonSet pod restore flow: daemonset pod restore flow

@skriss skriss force-pushed the restic-auto-manage-repos branch from c8cb556 to c44a16a Compare June 27, 2018 20:30
@skriss
Copy link
Contributor Author

skriss commented Jun 27, 2018

Addressed items discussed in new commits; will do squashing post-review.

@ncdc
Copy link
Contributor

ncdc commented Jun 27, 2018

LGTM

@skriss skriss force-pushed the restic-auto-manage-repos branch from c44a16a to 22e8f23 Compare June 27, 2018 20:51
@skriss
Copy link
Contributor Author

skriss commented Jun 27, 2018

squashed.

@ncdc ncdc merged commit 51298f8 into vmware-tanzu:master Jun 27, 2018
@skriss skriss deleted the restic-auto-manage-repos branch June 27, 2018 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants