Re-test restore of CAPI clusters with upstream fix to allow machine adoption #2554

wjun · 2020-05-21T06:33:23Z

What steps did you take and what happened:
I have 1 capv management cluster created by tkg 1.0 and create a workload cluster from this management cluster. I use Velero to backup only the namespace these workload cluster related objects belong to. When this tkg management cluster crashes, I create a new management cluster, and restore the workload cluster objects to this new cluster with the backup, so I can continue to manage the existing workload clusters(scale nodes, etc).

The major functions work, but I noticed when restore starts, a new control plane node creation for the workload cluster is triggered, which is unnecessary, since my workload cluster is already there.

What did you expect to happen:
I expect only those workload cluster objects are restored, and there is no any change to the targeting workload cluster during restore.

Anything else you would like to add:
I guess this is caused by workload cluster object continue to reconcile when it is restored but other objects are not restored yet. CAPI v1apha3 enhanced cluster reconcile with "Cluster.Spec.Paused" field (https://cluster-api.sigs.k8s.io/developer/providers/v1alpha2-to-v1alpha3.html#support-the-clusterx-k8siopaused-annotation-and-clusterspecpaused-field) so if we set this field to 'true' , it pauses reconciliation on the cluster and all associated objects. by this way, during restore, cluster will not reconcile new objects. Once restore completes, we can reset the field to 'false'.

Environment:

Velero version (use velero version): 1.3.2
Kubernetes version (use kubectl version): 1.17.3
Kubernetes installer & version: tkg 1.0 cli
Cloud provider or hardware configuration: capi v1alpha3 + capv
OS (e.g. from /etc/os-release): photon-3

The text was updated successfully, but these errors were encountered:

ashish-amarnath · 2020-05-21T22:52:16Z

@wjun Thanks for reporting this.
IMO this is an issue with the CAPV controllers and not Velero.
Velero is merely restoring the objects that were present in the backup.
CAPV controllers should, while reconciling the restored objects, detect that infra for the objects already exist and simply adopt those infrastructure resources and not provision new ones.

More specifically the issue seems to be in the kubeadm controlplane controller that fails to identify machines owned by it as the UID in the owner ref is different.
This is the line in code that I am referring to.
https://github.com/kubernetes-sigs/cluster-api/blob/master/controlplane/kubeadm/controllers/controller.go#L228

wjun · 2020-05-22T03:38:16Z

@ashish-amarnath I checked the owner references after restore, and noticed the existing control plane node object's owner reference is set to the current workload Cluster object's uid, which is incorrect, because its owner should be the kubeadmcontrolplane, while all other capi objects' owner references have been upated to the new uids correctly. Do you know if Velero updated the owner references during restore or CAPI did that?

ashish-amarnath · 2020-05-22T16:27:26Z

@wjun
The OwerRef on the machine objects is set by the kubeadm control plane controller. The UID is generated by the Kubernetes system to uniquely identify an object over its lifetime.
The UIDs for objects are regenerated when they are restored in the new management cluster.

wjun · 2020-05-25T06:00:32Z

@ashish-amarnath My question is which component should care about the ownerference update during restore as the owners' UIDs have changed. This looks a generic question that Velero may consider to process.

nrb · 2020-05-26T16:32:15Z

kubernetes-sigs/cluster-api#2489 should allow for control plane adoption. My understanding is that once it's in a CAPI release, Velero will be in a better position to restore CAPI objects so that they will be adopted.

skriss · 2020-06-10T21:24:00Z

Sounds like the action item for now is simply a re-test to determine if the CAPI change fixes this or not.

Are we able to re-test now that the fix has merged, or do we have to wait for a release here?

skriss · 2020-06-10T21:25:18Z

Moved this into the 1.5 milestone for retest.

ashish-amarnath · 2020-06-11T00:32:39Z

I can re-test with the fix. It's on my list. If there is no release with the fix by the time I get to it, I can build my own images to test it out.

wjun · 2020-07-16T00:47:53Z

I have re-tested with the cluster api v0.3.7, and the issue has been fixed.

ashish-amarnath self-assigned this May 22, 2020

skriss changed the title ~~During the restore of capi workload cluster objects, a new control plane node is created~~ Re-test restore of CAPI clusters with upstream fix to allow machine adoption Jun 10, 2020

skriss added this to the v1.5 milestone Jun 10, 2020

wjun closed this as completed Jul 16, 2020

ashish-amarnath mentioned this issue Nov 13, 2020

Add documentation for using Velero with cluster api #3054

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-test restore of CAPI clusters with upstream fix to allow machine adoption #2554

Re-test restore of CAPI clusters with upstream fix to allow machine adoption #2554

wjun commented May 21, 2020

ashish-amarnath commented May 21, 2020 •

edited

Loading

wjun commented May 22, 2020

ashish-amarnath commented May 22, 2020

wjun commented May 25, 2020 •

edited

Loading

nrb commented May 26, 2020

skriss commented Jun 10, 2020

skriss commented Jun 10, 2020

ashish-amarnath commented Jun 11, 2020

wjun commented Jul 16, 2020

Re-test restore of CAPI clusters with upstream fix to allow machine adoption #2554

Re-test restore of CAPI clusters with upstream fix to allow machine adoption #2554

Comments

wjun commented May 21, 2020

ashish-amarnath commented May 21, 2020 • edited Loading

wjun commented May 22, 2020

ashish-amarnath commented May 22, 2020

wjun commented May 25, 2020 • edited Loading

nrb commented May 26, 2020

skriss commented Jun 10, 2020

skriss commented Jun 10, 2020

ashish-amarnath commented Jun 11, 2020

wjun commented Jul 16, 2020

ashish-amarnath commented May 21, 2020 •

edited

Loading

wjun commented May 25, 2020 •

edited

Loading