Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 clusterctl move preflight check can fail even if the target cluster is up and running #4870

Closed
dlipovetsky opened this issue Jun 30, 2021 · 10 comments
Labels
area/clusterctl Issues or PRs related to clusterctl lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@dlipovetsky
Copy link
Contributor

What steps did you take and what happened:
I deployed a workload cluster. I then deployed the CAPI controllers on the workload cluster. Finally, I tried to move the workload cluster objects from the bootstrap the workload cluster. But clusterctl move failed one of its preflight checks:

Error: unable to pivot to the to-cluster: cannot start the move operation while "/, Kind=" default/example-control-plane-mpndl is still provisioning the node

The error is created here:

if machineObj.Status.NodeRef == nil {
errList = append(errList, errors.Errorf("cannot start the move operation while %q %s/%s is still provisioning the node", machineObj.GroupVersionKind(), machineObj.GetNamespace(), machineObj.GetName()))
}

In this case, the preflight check failed because KubeadmControlPlane controller was scaling up the control plane, and the newest Machine did not yet have a NodeRef.

It's worth noting that this preflight check can fail under other circumstances. For example, if a MachineDeployment is scaling up.

What did you expect to happen:

  1. I want to understand why this preflight check is required. The workload cluster is up and running. I think objects be safely moved, even if some Machine is missing a NodeRef, but I'm not certain. (@randomvariable mentioned in the 06/30/2021 CAPI meeting that this check might have been introduced to make sure that all control plane replicas had finished deploying)

  2. This preflight check might fail at any time a new Machine is created. So, if the preflight check is required, I want to understand how to proceed when it fails:

    • Should I run clusterctl move repeatedly until it passes?
    • Should I perform some steps to ensure the preflight check passes? For example, I could pause reconcilation of KubeadmControlPlane, MachineDeployment, etc, so that new Machine objects are created, before running clusterctl move.

Environment:

  • Cluster-api version: v0.3.19
  • Minikube/KIND version: v0.11.1
  • Kubernetes version: (use kubectl version): v1.21.1
  • OS (e.g. from /etc/os-release):

/kind bug
/area clusterctl

@fabriziopandini

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/clusterctl Issues or PRs related to clusterctl labels Jun 30, 2021
@CecileRobertMichon
Copy link
Contributor

cc @arunmk

@fabriziopandini
Copy link
Member

My two cents

  • technically, there is no clean way for clusterctl to determine if a provider in the source cluster is doing some long term operation, like e.g. provisioning a VM or some other piece of infrastructure.
  • If moves happens in the middle of one of the above long term operation, most probably the operation will fail because the Cluster API resources does not exist anymore in the source cluster; in the worst case, this might lead to orphaning bits of infrastructure.
  • the preflight checks provide a first part of the current solution to the above problem; by checking that provisioning is completed, we can reasonably assume the most common long term operations are completed; the second part is pausing controllers before moving, so we ensure no new long term operations starts while moving.

WRT to how to handle init & move, which is the primarily use case this command has been designed for, the recommended procedure is described in https://cluster-api.sigs.k8s.io/clusterctl/commands/move.html#bootstrap--pivot , where the key step for this thread most probably is "Wait for the target management cluster to be up and running" (--> provisioned).

FYI some recent try to redefine the scope of clusterctl, including also identifying the different requirements for this using move after the initial bootstrap scenario, have not gained enough traction(see #3354)

@dlipovetsky
Copy link
Contributor Author

Thank you for joining in the discussion, @fabriziopandini! 🙏

the key step for this thread most probably is "Wait for the target management cluster to be up and running" (--> provisioned).

To execute this step, I need some criteria that tells me that my cluster is "up and running." If I understand you correctly, you are saying that the criteria may be different for each cluster, and that the clusterctl preflight checks are a necessary subset of the criteria, but not the complete criteria:

  • technically, there is no clean way for clusterctl to determine if a provider in the source cluster is doing some long term operation, like e.g. provisioning a VM or some other piece of infrastructure.
  • If moves happens in the middle of one of the above long term operation, most probably the operation will fail because the Cluster API resources does not exist anymore in the source cluster; in the worst case, this might lead to orphaning bits of infrastructure.

I don't understand why this claim must be true.

AFAIK, we expect the controllers to be "re-entrant," i.e., to be able to resume no matter when they crash. (In practice, this means that the controllers' actions must be idempotent).

Is there a well-understood difference between the effect of "move" of all cluster objects, and the simultaneous crash (and restart) of all provider controllers?

(I do not see a difference, but I may be missing something important!)

If we can answer this question, I think we can make progress in this thread.

@fabriziopandini fabriziopandini removed the kind/bug Categorizes issue or PR as related to a bug. label Jul 6, 2021
@fabriziopandini
Copy link
Member

fabriziopandini commented Jul 6, 2021

I need some criteria that tells me that my cluster is "up and running."

IMO having all the machine in a cluster provisioned is a good signal; if you want to stay on the safe side, you could also check for all the nodes to be in ready state.

I don't understand why this claim must be true.

Let's start by saying that I'm more than happy to re-open the discussion about redefining the scope of move or changing requirements if now more people are interested in it (my previous attempt failed).

WRT to the previous comment, I was not really making a claim, but just noting that it might happen to get some orphaned infrastructure, and I fully agree with you that this should not happen if re-entrancy is properly implemented by all the providers. But why take the risk if we can avoid it?

Considering ^^, the fact that we are focused on the init & move use case only (which is typically a "controlled" sequence), the intrinsic complexity of move (see e.g recent changes for supporting global credentials), I personally think that the existing pre-flight checks are still reasonable and if possible to avoid moving while thinks are changing, the better (even if this could be seen as over protection); but as I said, happy to reconsider.

@dlipovetsky
Copy link
Contributor Author

dlipovetsky commented Jul 7, 2021

IMO having all the machine in a cluster provisioned is a good signal; if you want to stay on the safe side, you could also check for all the nodes to be in ready state.

Ok, let's say this is what we recommend. How do we expect a user to get this signal? Do we expect users to implement their own solutions? Can we provide a clusterctl command? Do we expect the user run clusterctl move repeatedly until it passes?

Would adding Conditions in MachineDeployment (#3486) help?

@vincepri
Copy link
Member

/milestone Next

@k8s-ci-robot k8s-ci-robot added this to the Next milestone Jul 28, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 26, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 26, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterctl Issues or PRs related to clusterctl lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants