Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding temporary multizone-nodepool support for CAS #7013

Merged

Conversation

aagusuab
Copy link
Contributor

@aagusuab aagusuab commented Jul 8, 2024

What type of PR is this?

What this PR does / why we need it:

This PR will prevent error once cluster-autoscaler is scaling the multi-zone nodepool from zero.
This is necessary because currently when we are trying to create a node template from the multi-zone nodepool, we create the zone-labels "topology.kubernetes.io/zone " where the label is "eastus-1__eastus-2__eastus3".
However, this label is not recognized by some affinities.
With this PR, Cluster Autoscaler will create node templates using any random zone that the VMSS contains. However, this zone isn't the zone VMSS assigns VMs with. The VMSS will decide separately#### Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/cluster-autoscaler labels Jul 8, 2024
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 8, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @aagusuab. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 8, 2024
@aagusuab aagusuab marked this pull request as ready for review July 8, 2024 16:23
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 8, 2024
@nojnhuh
Copy link
Contributor

nojnhuh commented Jul 9, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 9, 2024
Copy link

@willie-yao willie-yao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a minor nit.

Comment on lines 63 to 67
//Picks random zones for Multi-zone nodepool when scaling from zero.
//This random zone will not be the same as the zone of the VMSS that is being created, the purpose of creating
//the node template with random zone is to initiate scaling from zero on the multi-zone nodepool.
//Note that the if the customer is to have some pod affinity picking exact zone, this logic won't work.
//For now, discourage the customers from using podAffinity to pick the availability zones.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//Picks random zones for Multi-zone nodepool when scaling from zero.
//This random zone will not be the same as the zone of the VMSS that is being created, the purpose of creating
//the node template with random zone is to initiate scaling from zero on the multi-zone nodepool.
//Note that the if the customer is to have some pod affinity picking exact zone, this logic won't work.
//For now, discourage the customers from using podAffinity to pick the availability zones.
// Picks random zones for Multi-zone nodepool when scaling from zero.
// This random zone will not be the same as the zone of the VMSS that is being created, the purpose of creating
// the node template with random zone is to initiate scaling from zero on the multi-zone nodepool.
// Note that the if the customer is to have some pod affinity picking exact zone, this logic won't work.
// For now, discourage the customers from using podAffinity to pick the availability zones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

@nojnhuh
Copy link
Contributor

nojnhuh commented Jul 10, 2024

This PR will prevent error once cluster-autoscaler is scaling the multi-zone nodepool from zero.

Just for my own understanding, can you give an example of one of these errors?

This is necessary because currently when we are trying to create a node template from the multi-zone nodepool, we create the zone-labels "topology.kubernetes.io/zone " where the label is "eastus-1__eastus-2__eastus3". However, this label is not recognized by some affinities.

Also to help me understand, which affinities exactly are you referring to here?

@aagusuab
Copy link
Contributor Author

aagusuab commented Jul 22, 2024

This PR will prevent error once cluster-autoscaler is scaling the multi-zone nodepool from zero.

Just for my own understanding, can you give an example of one of these errors?

This is necessary because currently when we are trying to create a node template from the multi-zone nodepool, we create the zone-labels "topology.kubernetes.io/zone " where the label is "eastus-1__eastus-2__eastus3". However, this label is not recognized by some affinities.

Also to help me understand, which affinities exactly are you referring to here?

Sure, @nojnhuh!
One of the customers tried to specify an availability zone for one of the nodes, setting
topology.disk.csi.azure.com/zone=zone1

This caused an error because cluster autoscaler currently takes in multiple zones such that:
topology.disk.csi.azure.com/zone=zone1__zone2__zone3 (to account for multiple zones)

and the initial node label of "zone1" is not recognized by the multi-zone node label.

Therefore, we are removing this zone1__zone2__zone3 (or any similar availability_zone setup), and setting up the
topology.disk.csi.azure.com/zone=. to be of a random zone (this occurs when node template is being created/setup).
This does not affect the actual node label, as what determines the actual node label / zone is cloud provider, not the node template.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 31, 2024
@aagusuab aagusuab force-pushed the multi-nodepool-upstream-fix branch from cfeeea7 to c1d30e7 Compare July 31, 2024 18:41
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 31, 2024
//This random zone will not be the same as the zone of the VMSS that is being created, the purpose of creating
//the node template with random zone is to initiate scaling from zero on the multi-zone nodepool.
//Note that the if the customer is to have some pod affinity picking exact zone, this logic won't work.
//For now, discourage the customers from using podAffinity to pick the availability zones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, discourage the customers from using podAffinity to pick the availability zones
"... if using multi-zone nodepool" (but probably clear from context)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aagusuab, Bryce-Soghigian, comtalyst, tallaxes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 1, 2024
@tallaxes
Copy link
Contributor

tallaxes commented Aug 1, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 1, 2024
@k8s-ci-robot k8s-ci-robot merged commit 3e09d37 into kubernetes:master Aug 1, 2024
7 checks passed
@aagusuab
Copy link
Contributor Author

/cherry-pick cluster-autoscaler-release-1.27
/cherry-pick cluster-autoscaler-release-1.28
/cherry-pick cluster-autoscaler-release-1.29
/cherry-pick cluster-autoscaler-release-1.30

@k8s-infra-cherrypick-robot

@aagusuab: only kubernetes org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually.

In response to this:

/cherry-pick cluster-autoscaler-release-1.27
/cherry-pick cluster-autoscaler-release-1.28
/cherry-pick cluster-autoscaler-release-1.29
/cherry-pick cluster-autoscaler-release-1.30

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@comtalyst
Copy link
Contributor

/cherry-pick cluster-autoscaler-release-1.27
/cherry-pick cluster-autoscaler-release-1.28
/cherry-pick cluster-autoscaler-release-1.29
/cherry-pick cluster-autoscaler-release-1.30

@k8s-infra-cherrypick-robot

@comtalyst: #7013 failed to apply on top of branch "cluster-autoscaler-release-1.27":

Applying: Adding temporary multizone-nodepool support for CAS
Using index info to reconstruct a base tree...
M	cluster-autoscaler/cloudprovider/azure/azure_template.go
M	cluster-autoscaler/cloudprovider/azure/azure_template_test.go
Falling back to patching base and 3-way merge...
Auto-merging cluster-autoscaler/cloudprovider/azure/azure_template_test.go
CONFLICT (content): Merge conflict in cluster-autoscaler/cloudprovider/azure/azure_template_test.go
Auto-merging cluster-autoscaler/cloudprovider/azure/azure_template.go
CONFLICT (content): Merge conflict in cluster-autoscaler/cloudprovider/azure/azure_template.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Adding temporary multizone-nodepool support for CAS
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick cluster-autoscaler-release-1.27
/cherry-pick cluster-autoscaler-release-1.28
/cherry-pick cluster-autoscaler-release-1.29
/cherry-pick cluster-autoscaler-release-1.30

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@comtalyst
Copy link
Contributor

/cherry-pick cluster-autoscaler-release-1.30
/cherry-pick cluster-autoscaler-release-1.29
/cherry-pick cluster-autoscaler-release-1.28

@k8s-infra-cherrypick-robot

@comtalyst: new pull request created: #7190

In response to this:

/cherry-pick cluster-autoscaler-release-1.30
/cherry-pick cluster-autoscaler-release-1.29
/cherry-pick cluster-autoscaler-release-1.28

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@comtalyst
Copy link
Contributor

/cherry-pick cluster-autoscaler-release-1.29

@comtalyst
Copy link
Contributor

/cherry-pick cluster-autoscaler-release-1.28

@k8s-infra-cherrypick-robot

@comtalyst: new pull request created: #7192

In response to this:

/cherry-pick cluster-autoscaler-release-1.29

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot

@comtalyst: #7013 failed to apply on top of branch "cluster-autoscaler-release-1.28":

Applying: Adding temporary multizone-nodepool support for CAS
Using index info to reconstruct a base tree...
M	cluster-autoscaler/cloudprovider/azure/azure_template.go
M	cluster-autoscaler/cloudprovider/azure/azure_template_test.go
Falling back to patching base and 3-way merge...
Auto-merging cluster-autoscaler/cloudprovider/azure/azure_template_test.go
CONFLICT (content): Merge conflict in cluster-autoscaler/cloudprovider/azure/azure_template_test.go
Auto-merging cluster-autoscaler/cloudprovider/azure/azure_template.go
CONFLICT (content): Merge conflict in cluster-autoscaler/cloudprovider/azure/azure_template.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Adding temporary multizone-nodepool support for CAS
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick cluster-autoscaler-release-1.28

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants