Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ OnDelete rollout strategy #4346

Merged

Conversation

relyt0925
Copy link
Contributor

What this PR does / why we need it:
This PR adds a new rollout strategy in which a user can fully control the rollout of machines to a new configuration. The strategy is called "OnDelete" and behaves as follows.

Users control upgrading to a new MachineSet Configuration by triggering deletion of machines in the old machine set(s) with kubectl delete machine MACHINENAME. The machineDeployment controller waits for the machine to get deleted completely and then will proceed to provision the new replica with the new configuration.

In this model: in order to complete the rollout the user is responsible for ultimately "replacing" every machine using old configuration. This allows the user to identify and execute any order of machine upgrades they choose. They control the velocity of the rollout as well by how quickly they remove machines in the old configuration. The rollout of a new machine does not occur until the old machine is fully removed in the current implementation.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes # #4344

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 21, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @relyt0925!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @relyt0925. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 21, 2021
@relyt0925 relyt0925 changed the title ✨ OnDelete rollout strategy ✨ DNM: Prototype PR OnDelete rollout strategy Mar 21, 2021
@relyt0925 relyt0925 changed the title ✨ DNM: Prototype PR OnDelete rollout strategy ✨ DNM: Prototype PR: OnDelete rollout strategy Mar 21, 2021
@fabriziopandini
Copy link
Member

/milestone v0.4.0
/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Mar 22, 2021
@k8s-ci-robot k8s-ci-robot added this to the v0.4.0 milestone Mar 22, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 22, 2021
@relyt0925 relyt0925 force-pushed the ondelete-rollout-strategy branch 4 times, most recently from 0c0409d to 67421ec Compare March 23, 2021 02:05
@relyt0925
Copy link
Contributor Author

/retest

@relyt0925 relyt0925 changed the title ✨ DNM: Prototype PR: OnDelete rollout strategy ✨ OnDelete rollout strategy Mar 23, 2021
@relyt0925 relyt0925 force-pushed the ondelete-rollout-strategy branch 9 times, most recently from 671ad55 to cf9a163 Compare March 23, 2021 04:16
@@ -175,7 +175,7 @@ func TestReconcileUpdateObservedGeneration(t *testing.T) {
errGettingObject = testEnv.Get(ctx, util.ObjectKey(kcp), kcp)
g.Expect(errGettingObject).NotTo(HaveOccurred())
return kcp.Status.ObservedGeneration
}, 10*time.Second).Should(Equal(generation))
}, 20*time.Second).Should(Equal(generation))
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@relyt0925 let's revert this given that #4466 has merged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it!

@CecileRobertMichon
Copy link
Contributor

lgtm pending enum validation + revert of increased test timeout commit

// DisableMachineCreate is an annotation that can be used to signal a MachineSet to stop creating new machines.
// It is utilized in the OnDelete MachineDeploymentStrategy to allow the MachineDeployment controller to scale down
// older MachineSets when Machines are deleted and add the new replicas to the latest MachineSet.
DisableMachineCreate = "cluster.x-k8s.io/disable-machine-create"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to open an issue regarding pause and the update of statuses like @detiber suggested

@@ -199,6 +199,10 @@ func (r *MachineDeploymentReconciler) reconcile(ctx context.Context, cluster *cl
return ctrl.Result{}, r.rolloutRolling(ctx, d, msList)
}

if d.Spec.Strategy.Type == clusterv1.OnDeleteMachineDeploymentStrategyType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 adding the enum-based validation now as part of the v0.4 release

@@ -0,0 +1,182 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we name this file machinedeployment_rollout_ondelete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can! and done

return nil
}

//reconcileOldMachineSetsOnDelete handles reconciliation of Old MachineSets associated with the MachineDeployment in the OnDelete MachineDeploymentStrategyType.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//reconcileOldMachineSetsOnDelete handles reconciliation of Old MachineSets associated with the MachineDeployment in the OnDelete MachineDeploymentStrategyType.
// reconcileOldMachineSetsOnDelete handles reconciliation of Old MachineSets associated with the MachineDeployment in the OnDelete MachineDeploymentStrategyType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

log.V(4).Error(err, "failed to convert MachineSet %q label selector to a map", oldMS.Name)
continue
}
log.V(4).Info("fetching Machines associated with MachineSet", "MachineSet", oldMS.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.V(4).Info("fetching Machines associated with MachineSet", "MachineSet", oldMS.Name)
log.V(4).Info("Fetching Machines associated with MachineSet", "MachineSet", oldMS.Name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 99 to 100
err = patchHelper.Patch(ctx, oldMS)
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err = patchHelper.Patch(ctx, oldMS)
if err != nil {
if err := patchHelper.Patch(ctx, oldMS); err != nil {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 112 to 119
err = r.Client.List(ctx,
allMachinesInOldMS,
client.InNamespace(oldMS.Namespace),
client.MatchingLabels(selectorMap),
)
if err != nil {
return errors.Wrap(err, "failed to list machines")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err = r.Client.List(ctx,
allMachinesInOldMS,
client.InNamespace(oldMS.Namespace),
client.MatchingLabels(selectorMap),
)
if err != nil {
return errors.Wrap(err, "failed to list machines")
}
if err := r.Client.List(ctx,
allMachinesInOldMS,
client.InNamespace(oldMS.Namespace),
client.MatchingLabels(selectorMap),
); err != nil {
return errors.Wrap(err, "failed to list machines")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

log.V(4).Error(errors.Errorf("unexpected negative scale down amount: %d", machineSetScaleDownAmountDueToMachineDeletion), fmt.Sprintf("Error reconciling MachineSet %s", oldMS.Name))
}
scaleDownAmount -= machineSetScaleDownAmountDueToMachineDeletion
log.V(4).Info("adjusting replica count for deleted machines", "replicaCount", oldMS.Name, "replicas", updatedReplicaCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.V(4).Info("adjusting replica count for deleted machines", "replicaCount", oldMS.Name, "replicas", updatedReplicaCount)
log.V(4).Info("Adjusting replica count for deleted machines", "replicaCount", oldMS.Name, "replicas", updatedReplicaCount)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


if ms.Annotations != nil {
if _, ok := ms.Annotations[clusterv1.DisableMachineCreate]; ok {
log.Info("Automatic creation of new machines disabled for machine set")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Info("Automatic creation of new machines disabled for machine set")
log.V(2).Info("Automatic creation of new machines disabled for machine set")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be a condition of some sorts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial comment done.

@@ -175,7 +175,7 @@ func TestReconcileUpdateObservedGeneration(t *testing.T) {
errGettingObject = testEnv.Get(ctx, util.ObjectKey(kcp), kcp)
g.Expect(errGettingObject).NotTo(HaveOccurred())
return kcp.Status.ObservedGeneration
}, 10*time.Second).Should(Equal(generation))
}, 20*time.Second).Should(Equal(generation))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this? #4466 disabled this test temporarily

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@relyt0925
Copy link
Contributor Author

Issue opened for potential bug with paused annotation:
#4473

@relyt0925 relyt0925 force-pushed the ondelete-rollout-strategy branch 2 times, most recently from c9dcf9b to 105d0dc Compare April 13, 2021 20:35
@relyt0925
Copy link
Contributor Author

/test pull-cluster-api-test-main

1 similar comment
@relyt0925
Copy link
Contributor Author

/test pull-cluster-api-test-main

@relyt0925
Copy link
Contributor Author

@CecileRobertMichon @vincepri @detiber all comments should be addressed.

@CecileRobertMichon
Copy link
Contributor

did you rebase on top of the latest master branch to include #4466?

@relyt0925
Copy link
Contributor Author

/test pull-cluster-api-test-main

@relyt0925
Copy link
Contributor Author

I did!

return errors.Wrap(err, "failed to list machines")
}
totalMachineCount := int32(len(allMachinesInOldMS.Items))
log.V(4).Info("retrieved machines", "totalMachineCount", totalMachineCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use Uppercase Retrieved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dont rely on status

update

check for nil

update

update strategy

double wait time for flakey kubeadm test

update with comments

update
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 14, 2021
@relyt0925
Copy link
Contributor Author

/approve

@relyt0925
Copy link
Contributor Author

@fabriziopandini would you mind issuing a /approve when you have a moment. Thank you everyone for your time!!!!

@detiber
Copy link
Member

detiber commented Apr 14, 2021

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: detiber, relyt0925

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 14, 2021
@k8s-ci-robot k8s-ci-robot merged commit 3bc596e into kubernetes-sigs:master Apr 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants