Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-test restore of CAPI clusters with upstream fix to allow machine adoption #2554

Closed
wjun opened this issue May 21, 2020 · 9 comments
Closed
Assignees
Milestone

Comments

@wjun
Copy link

wjun commented May 21, 2020

What steps did you take and what happened:
I have 1 capv management cluster created by tkg 1.0 and create a workload cluster from this management cluster. I use Velero to backup only the namespace these workload cluster related objects belong to. When this tkg management cluster crashes, I create a new management cluster, and restore the workload cluster objects to this new cluster with the backup, so I can continue to manage the existing workload clusters(scale nodes, etc).

The major functions work, but I noticed when restore starts, a new control plane node creation for the workload cluster is triggered, which is unnecessary, since my workload cluster is already there.

What did you expect to happen:
I expect only those workload cluster objects are restored, and there is no any change to the targeting workload cluster during restore.

Anything else you would like to add:
I guess this is caused by workload cluster object continue to reconcile when it is restored but other objects are not restored yet. CAPI v1apha3 enhanced cluster reconcile with "Cluster.Spec.Paused" field (https://cluster-api.sigs.k8s.io/developer/providers/v1alpha2-to-v1alpha3.html#support-the-clusterx-k8siopaused-annotation-and-clusterspecpaused-field) so if we set this field to 'true' , it pauses reconciliation on the cluster and all associated objects. by this way, during restore, cluster will not reconcile new objects. Once restore completes, we can reset the field to 'false'.

Environment:

  • Velero version (use velero version): 1.3.2
  • Kubernetes version (use kubectl version): 1.17.3
  • Kubernetes installer & version: tkg 1.0 cli
  • Cloud provider or hardware configuration: capi v1alpha3 + capv
  • OS (e.g. from /etc/os-release): photon-3
@ashish-amarnath
Copy link
Member

ashish-amarnath commented May 21, 2020

@wjun Thanks for reporting this.
IMO this is an issue with the CAPV controllers and not Velero.
Velero is merely restoring the objects that were present in the backup.
CAPV controllers should, while reconciling the restored objects, detect that infra for the objects already exist and simply adopt those infrastructure resources and not provision new ones.

More specifically the issue seems to be in the kubeadm controlplane controller that fails to identify machines owned by it as the UID in the owner ref is different.
This is the line in code that I am referring to.
https://github.com/kubernetes-sigs/cluster-api/blob/master/controlplane/kubeadm/controllers/controller.go#L228

@wjun
Copy link
Author

wjun commented May 22, 2020

@ashish-amarnath I checked the owner references after restore, and noticed the existing control plane node object's owner reference is set to the current workload Cluster object's uid, which is incorrect, because its owner should be the kubeadmcontrolplane, while all other capi objects' owner references have been upated to the new uids correctly. Do you know if Velero updated the owner references during restore or CAPI did that?

@ashish-amarnath
Copy link
Member

@wjun
The OwerRef on the machine objects is set by the kubeadm control plane controller. The UID is generated by the Kubernetes system to uniquely identify an object over its lifetime.
The UIDs for objects are regenerated when they are restored in the new management cluster.

@ashish-amarnath ashish-amarnath self-assigned this May 22, 2020
@wjun
Copy link
Author

wjun commented May 25, 2020

@ashish-amarnath My question is which component should care about the ownerference update during restore as the owners' UIDs have changed. This looks a generic question that Velero may consider to process.

@nrb
Copy link
Contributor

nrb commented May 26, 2020

kubernetes-sigs/cluster-api#2489 should allow for control plane adoption. My understanding is that once it's in a CAPI release, Velero will be in a better position to restore CAPI objects so that they will be adopted.

@skriss
Copy link
Member

skriss commented Jun 10, 2020

Sounds like the action item for now is simply a re-test to determine if the CAPI change fixes this or not.

Are we able to re-test now that the fix has merged, or do we have to wait for a release here?

@skriss skriss changed the title During the restore of capi workload cluster objects, a new control plane node is created Re-test restore of CAPI clusters with upstream fix to allow machine adoption Jun 10, 2020
@skriss skriss added this to the v1.5 milestone Jun 10, 2020
@skriss
Copy link
Member

skriss commented Jun 10, 2020

Moved this into the 1.5 milestone for retest.

@ashish-amarnath
Copy link
Member

I can re-test with the fix. It's on my list. If there is no release with the fix by the time I get to it, I can build my own images to test it out.

@wjun
Copy link
Author

wjun commented Jul 16, 2020

I have re-tested with the cluster api v0.3.7, and the issue has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants