Add support for FailureDomains to AzureMachinePool #667

fiunchinho · 2020-06-02T07:38:23Z

What this PR does / why we need it:
It was not possible to choose the FailureDomains when creating a MachinePool because it was using a custom AzureMachineTemplateSpec. This PR changes the code so that MachinePool uses the same AzureMachineTemplateSpec than other parts of the code.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #663

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

Added `FailureDomains` field to `AzureMachinePoolSpec`

k8s-ci-robot · 2020-06-02T07:38:30Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fiunchinho
To complete the pull request process, please assign justaugustus
You can assign the PR to them by writing /assign @justaugustus in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2020-06-02T07:38:30Z

Welcome @fiunchinho!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-azure 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-azure has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2020-06-02T07:38:31Z

Hi @fiunchinho. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nader-ziada · 2020-06-02T14:01:49Z

/ok-to-test

fiunchinho · 2020-06-02T14:32:15Z

Build is failing with

  Incompatible changes:
  - AzureMachinePoolSpec.Template: changed from AzureMachineTemplate to sigs.k8s.io/cluster-api-provider-azure/api/v1alpha3.AzureMachineSpec
  - AzureMachineTemplate: removed

Not sure how to solve it or what does it mean. I need directions, please.

nader-ziada · 2020-06-02T14:42:08Z

@fiunchinho I believe this is informational to bring attention to the fact that there are breaking changes to the api

CecileRobertMichon · 2020-06-02T16:22:10Z

/hold

@devigned @juan-lee I remember we had a discussion about this during machine pool implementation, what was the reasoning for using a different spec?

devigned · 2020-06-02T16:43:52Z

I believe this was the choice since AzureMachineSpec is likely to have more settings in the VM specific type. One example is in AzureMachinePool, FailureDomains (AZs) would likely be set on the AMP rather than on the AzureMachineSpec since they are provider controlled.

Another problem would be OSDisk. The name of the OSDisk is not honored in VirtualMachineScaleSets. If you specify a name, the PUT to VMSS fails.

I believe there were other concerns where the template would differ.

Should there be some base structure that are shared, perhaps. Though, I would prefer a little copy paste of data structure rather than more complex composition to reduce lines of code. My 2¢.

fiunchinho · 2020-06-03T13:57:39Z

So instead of removing AzureMachineTemplate from azuremachinepools_types.go, should I add a FailureDomain field to it? or rather to AzureMachinePoolSpec?

devigned · 2020-06-03T18:57:57Z

I think FailureDomains should be ~~[]string~~ clusterv1.FailureDomains on the AzureMachinePoolSpec, rather than the AzureMachineTemplate since the underlying provider will decide how to balance machines across FDs.

FailureDomains on the AzureMachinePoolSpec would map to Zones: []string in the VMSS REST API.

Does this make sense to folks?

CecileRobertMichon · 2020-06-03T19:15:35Z

I think FailureDomains should be []string on the AzureMachinePoolSpec, rather than the AzureMachineTemplate since the underlying provider will decide how to balance machines across FDs.
FailureDomains on the AzureMachinePoolSpec would map to Zones: []string in the VMSS REST API.
Does this make sense to folks?

We definitely want to leverage Scale Set AZ placement rather than defining our own logic for placing individual machines in zones, that was one of the motivations for implementing MachinePools / VMSS in the first place. Should it be of type clusterv1.FailureDomains to match the existing failure domain type for AzureCluster though?

I think the default behavior needs to try to spread instances on all the available failure domains (from the AzureCluster status), and only use the zones defined in the AzureMachinePoolSpec for explicit placement, as described in https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/docs/topics/failure-domains.md#L17.

We also had a discussion in the CAPZ office hours with @richardcase about changing the existing logic and not needing the field in AzureMachine, we need to follow up on that.

devigned · 2020-06-03T19:20:02Z

@CecileRobertMichon thank you for the correction and further explanation!

…template.

fiunchinho · 2020-06-08T16:15:12Z

Submitted kubernetes-sigs/cluster-api#3157

fiunchinho · 2020-06-15T11:42:12Z

Given that kubernetes-sigs/cluster-api#3157 gets merged, any feedback about the implementation in this PR?

devigned · 2020-06-15T18:03:16Z

failureDomain (string): the string identifier of the failure domain the instance is running in for the purposes of backwards compatibility and migrating to the v1alpha3 FailureDomain support (where FailureDomain is specified in Machine.Spec.FailureDomain). This field is meant to be temporary to aid in migration of data that was previously defined on the provider type and providers will be expected to remove the field in the next version that provides breaking API changes, favoring the value defined on Machine.Spec.FailureDomain instead. If supporting conversions from previous types, the provider will need to support a conversion from the provider-specific field that was previously used to the failureDomain field to support the automated migration path.

via Machine Infra Provider Spec

With the above guidance in mind, I would expect the following pending kubernetes-sigs/cluster-api#3157.

FailureDomains []string on MachinePool would be mapped into a Zones []string on the VMSSSpec, which would then be used to set compute.VirtualMachineScaleSet.Zones.

The compute.VirtualMachineScaleSet.Zones property is immutable after creation in Azure. If a change occurs to MachinePool.Zones after the AzureMachinePool has been created, then the AzureMachinePool and the MachinePool should go into a failed state with an error message indicating the reason for the failure.

@CecileRobertMichon, what do you think about having Zones on the status for AzureMachinePool since zones can only be specified on create?

Anyone have any other thoughts or fill in any blanks I missed?

CecileRobertMichon · 2020-06-15T18:21:56Z

The compute.VirtualMachineScaleSet.Zones property is immutable after creation in Azure. If a change occurs to MachinePool.Zones after the AzureMachinePool has been created, then the AzureMachinePool and the MachinePool should go into a failed state with an error message indicating the reason for the failure.

There should be a webhook validation on Update() that doesn't allow changes to FailureDomains if this applies to all providers (do we know of any cases where failure domains would be mutable?) that way it the machine pool never goes into failed state.

@CecileRobertMichon, what do you think about having Zones on the status for AzureMachinePool since zones can only be specified on create?

That makes sense to me but we don't do this for Machines currently, we should be consistent and do it for both? Maybe as a follow up? Keep in mind that AZs aren't supported in every region so that field won't always be set.

devigned · 2020-06-15T18:53:42Z

That makes sense to me but we don't do this for Machines currently, we should be consistent and do it for both? Maybe as a follow up? Keep in mind that AZs aren't supported in every region so that field won't always be set.

The more I think about this, the less value I think it provides. If the MachinePoolSpec says it is in those zones and the infrastructure is in a succeeded state, then I think we are right to assume the zones in the spec are reconciled to the infrastructure.

There should be a webhook validation on Update() that doesn't allow changes to FailureDomains if this applies to all providers (do we know of any cases where failure domains would be mutable?) that way it the machine pool never goes into failed state.

This would be great if we could say FailureDomains are immutable for all, but I don't think we can. For example, AWS AutoScale groups can add zones.

CecileRobertMichon · 2020-07-28T21:57:45Z

@fiunchinho would you like to move forward with this PR now that capi v0.3.7 is in? It looks like it's very close

k8s-ci-robot · 2020-10-06T19:25:38Z

@fiunchinho: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2021-01-04T19:34:53Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

k8s-ci-robot · 2021-01-07T17:17:15Z

@fiunchinho: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-cluster-api-provider-azure-build	`bd20079`	link	`/test pull-cluster-api-provider-azure-build`
pull-cluster-api-provider-azure-test	`bd20079`	link	`/test pull-cluster-api-provider-azure-test`
pull-cluster-api-provider-azure-e2e	`bd20079`	link	`/test pull-cluster-api-provider-azure-e2e`
pull-cluster-api-provider-azure-e2e-windows	`bd20079`	link	`/test pull-cluster-api-provider-azure-e2e-windows`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

fejta-bot · 2021-02-06T17:26:28Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fiunchinho · 2021-02-17T17:56:52Z

Superseeded by #1180

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 2, 2020

k8s-ci-robot requested review from awesomenix and serbrech June 2, 2020 07:38

k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Jun 2, 2020

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 2, 2020

fiunchinho force-pushed the machinepool-machinetemplate branch from 4380596 to cb9a19f Compare June 2, 2020 07:40

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 2, 2020

fiunchinho force-pushed the machinepool-machinetemplate branch from cb9a19f to 238f3cd Compare June 2, 2020 08:00

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 2, 2020

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 2, 2020

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 2, 2020

k8s-ci-robot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 4, 2020

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 8, 2020

fiunchinho force-pushed the machinepool-machinetemplate branch from ade8919 to d66a625 Compare June 8, 2020 15:41

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 8, 2020

fiunchinho added 2 commits June 8, 2020 17:44

Use common AzureMachineTemplate in AzureMachinePool

aa67911

Add FailureDomains field to AzureMachinePoolSpec. Bring back Machine …

1b3e89e

…template.

fiunchinho force-pushed the machinepool-machinetemplate branch from d66a625 to 11dba83 Compare June 8, 2020 15:47

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 8, 2020

Reconcile FailureDomains in AzureMachinePool

bd20079

fiunchinho force-pushed the machinepool-machinetemplate branch from 11dba83 to bd20079 Compare June 8, 2020 16:04

CecileRobertMichon mentioned this pull request Jul 29, 2020

Document that only control plane machines will automatically get spread on failure domains #844

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 6, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 6, 2021

fiunchinho closed this Feb 17, 2021

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for FailureDomains to AzureMachinePool #667

Add support for FailureDomains to AzureMachinePool #667

fiunchinho commented Jun 2, 2020 •

edited

Loading

k8s-ci-robot commented Jun 2, 2020

k8s-ci-robot commented Jun 2, 2020

k8s-ci-robot commented Jun 2, 2020

nader-ziada commented Jun 2, 2020

fiunchinho commented Jun 2, 2020

nader-ziada commented Jun 2, 2020

CecileRobertMichon commented Jun 2, 2020

devigned commented Jun 2, 2020

fiunchinho commented Jun 3, 2020 •

edited

Loading

devigned commented Jun 3, 2020 •

edited

Loading

CecileRobertMichon commented Jun 3, 2020

devigned commented Jun 3, 2020

fiunchinho commented Jun 8, 2020

fiunchinho commented Jun 15, 2020

devigned commented Jun 15, 2020

CecileRobertMichon commented Jun 15, 2020 •

edited

Loading

devigned commented Jun 15, 2020

CecileRobertMichon commented Jul 28, 2020

k8s-ci-robot commented Oct 6, 2020

fejta-bot commented Jan 4, 2021

k8s-ci-robot commented Jan 7, 2021

fejta-bot commented Feb 6, 2021

fiunchinho commented Feb 17, 2021

Add support for FailureDomains to AzureMachinePool #667

Add support for FailureDomains to AzureMachinePool #667

Conversation

fiunchinho commented Jun 2, 2020 • edited Loading

k8s-ci-robot commented Jun 2, 2020

k8s-ci-robot commented Jun 2, 2020

k8s-ci-robot commented Jun 2, 2020

nader-ziada commented Jun 2, 2020

fiunchinho commented Jun 2, 2020

nader-ziada commented Jun 2, 2020

CecileRobertMichon commented Jun 2, 2020

devigned commented Jun 2, 2020

fiunchinho commented Jun 3, 2020 • edited Loading

devigned commented Jun 3, 2020 • edited Loading

CecileRobertMichon commented Jun 3, 2020

devigned commented Jun 3, 2020

fiunchinho commented Jun 8, 2020

fiunchinho commented Jun 15, 2020

devigned commented Jun 15, 2020

CecileRobertMichon commented Jun 15, 2020 • edited Loading

devigned commented Jun 15, 2020

CecileRobertMichon commented Jul 28, 2020

k8s-ci-robot commented Oct 6, 2020

fejta-bot commented Jan 4, 2021

k8s-ci-robot commented Jan 7, 2021

fejta-bot commented Feb 6, 2021

fiunchinho commented Feb 17, 2021

fiunchinho commented Jun 2, 2020 •

edited

Loading

fiunchinho commented Jun 3, 2020 •

edited

Loading

devigned commented Jun 3, 2020 •

edited

Loading

CecileRobertMichon commented Jun 15, 2020 •

edited

Loading