Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Implement MachineDeployment rolloutAfter support #7053

Closed
wants to merge 1 commit into from

Conversation

chrischdi
Copy link
Member

What this PR does / why we need it:

If the reconciliation time is after spec.rolloutAfter then a rollout should happen or has already happened.
A new MachineSet will be created at the first time the reconciliation time is after spec.rolloutAfter.
Otherwise the oldest with creation timestamp > lastRolloutAfter annotation is picked.
If a new MachineSet is required due to reconciliation time > spec.rolloutAfter the rolloutAfter time is added for creating the hash of the MachineSet name.
When a new MachineSet is created the name does not clash with the existing MachineSet having the same template and the rollout can be orchestrated as usual.

Co-authored-by: Enxebre alberto.garcial@hotmail.com

Compared to the previous PR at #4596 I did the following changes:

  • Refactored the table tests and tried to catch all cases
  • Adjusted the generateMachineSetName func to not append another hash to the name, because this would extend the machine object name which could cause other unexpected issues for providers / machines due to the extended length. Instead I decided to recalculate the hash using the same information plus the rolloutAfter value.
  • The current value of MachineDeployment.Spec.RolloutAfter gets now added to the MachineSet when it is getting created. By that the sorting algorithm helps to return the MachineSet by using the following sort criteria:
    1. New: > lastRolloutAnnotation
    2. < creationTimestamp
    3. < Name

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4536

Additional information

  • Current sorting algorithm:

    1. < creationTimestamp
    2. < Name
  • Table to determine all kind of cases (I hope this does not cause more confusion than not having this info, it did help to find the correct implementation):

    # Case Equal(MS,TPL) MD.RolloutAfter < now MD.RolloutAfter vs MS.CreationTimestamp Result
    1 A no < (irrelevant) < (irrelevant) create
    2 A no < (irrelevant) > (irrelevant) create
    3 A no > (irrelevant) < (irrelevant) create
    4 A no > (irrelevant) > (irrelevant) create
    5 B yes < < create
    6 C yes < > no-op
    7 D yes > < (irrelevant) no-op
    8 D yes > > (irrelevant) no-op

    Reduced table by Case:

    Case Equal(MS, TPL) MD.RolloutAfter vs now MD.RolloutAfter vs MS.CreationTimestamp Return Value
    A false - - nil / Create
    B true < < nil / Create
    C true < > MS / no-op
    D true > - MS / no-op

    Case description:

    • A: Create new MachineSet because there is no existing having an equivalent template
    • B: Create new MachineSet having the same template due to RolloutAfter
    • C: Keep old MachineSet which has an equal MachineTemplate because RolloutAfter was already done
    • D: Keep old MachineSet which has an equal MachineTemplate because RolloutAfter should be done in the future

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 11, 2022
@k8s-ci-robot
Copy link
Contributor

@chrischdi: This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chrischdi chrischdi changed the title ✨ Implement MachineDeployment rolloutAfter support ✨ [wip] Implement MachineDeployment rolloutAfter support Aug 11, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign neolit123 for approval by writing /assign @neolit123 in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 11, 2022
@chrischdi chrischdi force-pushed the pr-rollout-after branch 3 times, most recently from 246eeb0 to f4f2735 Compare August 11, 2022 16:05
@chrischdi chrischdi changed the title ✨ [wip] Implement MachineDeployment rolloutAfter support ✨ Implement MachineDeployment rolloutAfter support Aug 11, 2022
@chrischdi
Copy link
Member Author

@chrischdi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-verify-main f4f2735 link true /test pull-cluster-api-verify-main
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Have to take a look at this :-)

@sbueringer
Copy link
Member

sbueringer commented Aug 16, 2022

@vincepri @enxebre Given how long we spent on the previous PR, would be good to get a first opinion from your side.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 19, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2022
@chrischdi chrischdi force-pushed the pr-rollout-after branch 3 times, most recently from 55823ed to b94dc55 Compare August 27, 2022 18:42
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 20, 2022
If the reconciliation time is after spec.rolloutAfter then a rollout should happen or has already happened.
A new MachineSet will be created at the first time the reconciliation time is after spec.rolloutAfter.
Otherwise the oldest with creation timestamp > lastRolloutAfter annotation is picked.
If a new MachineSet is required due to reconciliation time > spec.rolloutAfter the rolloutAfter time is added for creating the hash of the MachineSet name.
When a new MachineSet is created the name does not clash with the existing MachineSet having the same template and the rollout can be orchestrated as usual.

Co-authored-by: Enxebre <alberto.garcial@hotmail.com>
@chrischdi
Copy link
Member Author

/test help

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 22, 2022
@k8s-ci-robot
Copy link
Contributor

@chrischdi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-cluster-api-build-main
  • /test pull-cluster-api-e2e-main
  • /test pull-cluster-api-test-main
  • /test pull-cluster-api-test-mink8s-main
  • /test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-apidiff-main
  • /test pull-cluster-api-e2e-full-main
  • /test pull-cluster-api-e2e-informing-ipv6-main
  • /test pull-cluster-api-e2e-informing-main
  • /test pull-cluster-api-e2e-workload-upgrade-1-25-latest-main

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-apidiff-main
  • pull-cluster-api-build-main
  • pull-cluster-api-e2e-informing-ipv6-main
  • pull-cluster-api-e2e-informing-main
  • pull-cluster-api-e2e-main
  • pull-cluster-api-test-main
  • pull-cluster-api-test-mink8s-main
  • pull-cluster-api-verify-main

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chrischdi
Copy link
Member Author

/test pull-cluster-api-apidiff-main
/test pull-cluster-api-e2e-full-main
/test pull-cluster-api-e2e-informing-ipv6-main
/test pull-cluster-api-e2e-informing-main
/test pull-cluster-api-e2e-workload-upgrade-1-25-latest-main

@vincepri
Copy link
Member

From a quick glance, the current changes make sense to me, although these changes touch on the hashing code that @fabriziopandini was looking at for in place propagation of labels and annotations

@chrischdi
Copy link
Member Author

From a quick glance, the current changes make sense to me, although these changes touch on the hashing code that @fabriziopandini was looking at for in place propagation of labels and annotations

Fair 👍 so better hold this and adapt depending on what in place propagation may change.

@sbueringer
Copy link
Member

sbueringer commented Sep 27, 2022

From a quick glance, the current changes make sense to me, although these changes touch on the hashing code that @fabriziopandini was looking at for in place propagation of labels and annotations

Fair +1 so better hold this and adapt depending on what in place propagation may change.

Yup. +/- ideally consider what we want to do in this PR during implementation of in-place mutation so it fits nicely.

// see https://github.com/kubernetes/kubernetes/issues/40415
// Besides only considering MachineSets which have an equivalent MachineTemplateSpec, we choose the MachineSet
// which has the most recent RolloutAfter annotation set (if any) or as second criteria is the oldest one.
sort.Sort(MachineSetsByRolloutAfterAnnotationAndCreationTimestamp(msList))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at this logic in the context of Node Label propagation / in-place upgrades, and I have noticed that this approach can cause turbulence in the Cluster because it leads to picking one of the matching MS without keeping into account where the machines are. So IMO the sort criteria should be modified in order to pick the MS with more machines on it (*)

This could probably simplify the entire logic by dropping the annotation on MS, and rollout will be triggered by the if in the next for loop that drops MS if rollout after is triggered

(*) this could be a separated PR that we merge as precedence of this one

@@ -254,6 +262,42 @@ func (r *Reconciler) getNewMachineSet(ctx context.Context, d *clusterv1.MachineD
return createdMS, err
}

func generateMachineSetName(d *clusterv1.MachineDeployment, now *metav1.Time) (string, string, error) {
Copy link
Member

@fabriziopandini fabriziopandini Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash is currently used as:

  • UID to identify machines belonging to the MS
  • Adding a unique suffix to the MS set name

Given that I'm really wondering if we should drop the current spew/hash logic and simply use a random string + a check that verifies that the random string is not already taken by an existing MS (for this MD). It seems that the code could be re-entrant also it this way and we can get rid of all this complex logic (*) ...

@vincepri @enxebre @sbueringer opinions?

(*) this could be a separated PR that we merge as precedence of this one

Copy link
Member

@sbueringer sbueringer Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini Sorry missed the mention somehow.

Sounds fine to me, assuming we can make this re-entrant (I didn't look at the code in detail to see how this would be achieved).

+100 to making this a separate PR independent of this work and the propagation work

Would be nice to get rid of the hash early in the v1.4 cycle to give us time to discover potential side effects

@k8s-ci-robot
Copy link
Contributor

@chrischdi: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 28, 2022
@k8s-ci-robot
Copy link
Contributor

@chrischdi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-test-main 50ece3b link true /test pull-cluster-api-test-main
pull-cluster-api-test-mink8s-main 50ece3b link true /test pull-cluster-api-test-mink8s-main
pull-cluster-api-e2e-main 50ece3b link true /test pull-cluster-api-e2e-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@chrischdi
Copy link
Member Author

This is gonna be replaced by #7053 so closing in favor of it.

@chrischdi chrischdi closed this Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for RolloutAfter to MachineDeployments
6 participants