Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Detect when Machines in MachinePool VMSS are not running on the latest model #2975

Conversation

primeroz
Copy link
Contributor

@primeroz primeroz commented Jan 2, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:
Uses the value for latestModelApplied as returned by Azure to determine if the Machine is up-to-date or not.

Capz will Delete and replace nodes that are not running the latest mode, fixing the issue of rolling out a change to the vmSize or Kubernetes Version to a set of nodes in a machinepool.
I think this is what we want based on the documentation here which at the moment is not true ( at least in my testing )

Which issue(s) this PR fixes :
Fixes # #2972

Special notes for your reviewer:

  • I updated some tests to include the new field but i did not add any new test.
  • I marked this as feature but maybe it should be a bug since at the moment the controller does not do what the docs says it should do when spec of machinepool change ?

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Improve detecting when a `azureMachinePoolMachine` is not running the `latest version` of the spec to allow CAPZ to fully refresh the nodes in a `machinePool` when change is required.  

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 2, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jackfrancis for approval by writing /assign @jackfrancis in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @primeroz. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 2, 2023
@primeroz primeroz force-pushed the chore/DetectWhenMachinesInVMSSSAreNotLatestModel branch from 9437d45 to 1506cce Compare January 2, 2023 17:22
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 2, 2023
@primeroz primeroz marked this pull request as ready for review January 2, 2023 17:25
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 2, 2023
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test
/assign @mboersma

there is a good chance this conflicts with #999 so we might need to do some rebasing

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 3, 2023
@primeroz
Copy link
Contributor Author

primeroz commented Jan 4, 2023

there is a good chance this conflicts with #999 so we might need to do some rebasing

Yes , i see.

I will join the Office hours in case you want to talk about this ?

On one side this is kind of a blocker for us to be able to use MachinePools but on the other side if such a big refactor of MachinePool codebase is coming through that other issue it might be worth waiting for it ?

@mboersma
Copy link
Contributor

mboersma commented Jan 5, 2023

I think it will be possible to rebase this around #2813, which is a priority for the next release and should merge soon. I'm happy to help with that if it's tricky @primeroz.

@primeroz
Copy link
Contributor Author

primeroz commented Jan 5, 2023

@mboersma it sounds good, i still want to do some testing anyway , when #2813 is merged i will either rebase or rebranch/rewrite ( which might just be quicker :) )

@primeroz
Copy link
Contributor Author

primeroz commented Jan 5, 2023

@mboersma while looking at #2813 i came up with a question about the approach in this PR

This change i did is only making the detection of nodes that are not running the latestmodel more precise and relies on deleting AzureMachinePoolMachine instances in the MachinePoolScope to replace them one at the time . this does not take into account the surge settings of the rollout strategy

while testing before writing this PR i can see that when a VMSS is patched there is a surge that only replace the number of nodes defined in the rollout strategy instead of replacing all the nodes in the pool ( which is the original issue i opened ) and it seems to be handled here

To my understanding this surge control code does not change in #2813 and will do exactly the same thing.
Do you think it make more sense to change this surge control code to replace all nodes , using the rollout strategy, when a VMSS is patched rather than relying only on detecting when nodes are not running the latestmodel ?

@jackfrancis jackfrancis added this to the next milestone Jan 5, 2023
@primeroz
Copy link
Contributor Author

primeroz commented Jan 5, 2023

note to self, during Office Hour it was brought up this could clash with the join token issue . I will bring this up in slack for discussion to understand what is the best way to proceed with this

related to kubernetes-sigs/cluster-api#7717 and #2683 ?

@primeroz
Copy link
Contributor Author

I am closing this PR for now.

I am not so sure this is the right approach so i will write something up and post it into Slack for feedback before picking this up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants