Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Azure Provider HasInstance implementation #6956

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Bryce-Soghigian
Copy link
Member

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

CA fails to scale up or cancel in progress schaledown when there are unschedulable pods. Stealing this description from the aws provider implementation.

I think the description of #5054 (comment) explains it well:
...original intent of determining the deleted nodes was incorrect, which led to the issues reported by other users. The nodes tainted with ToBeDeleted were misidentified as Deleted instead of Ready/Unready, which caused a miscalculation of the node being included as Upcoming. This caused problems described in #3949 and #4456.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. labels Jun 21, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Bryce-Soghigian
Once this PR has been reviewed and has the lgtm label, please assign nilo19 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 21, 2024
@k8s-ci-robot k8s-ci-robot requested a review from nilo19 June 21, 2024 16:43
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 21, 2024
@Bryce-Soghigian
Copy link
Member Author

/test all

@Bryce-Soghigian Bryce-Soghigian marked this pull request as ready for review June 21, 2024 18:38
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 21, 2024
Copy link

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, with minor feedback, and some comments re implication of using cache.
How was this tested? Can we add unit tests? E2E tests?

func (m *azureCache) HasInstance(providerID string) bool {
m.mutex.Lock()
defer m.mutex.Unlock()
_, ok := m.instanceToNodeGroup[azureRef{Name: providerID}]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are potential issues with case sensitivity here; should use getInstanceFromCache which ignores case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using cache to respond here makes sense to me in general. Is there anything else we can say about the consequences relying on (two-level?) cache? Would be good to think through the implications of temporarily falsely reporting instance does not exist (if it is not in cache yet), temporarily falsely reporting instance does exist (if it is in cache still), and potential max delay (based on our cache refresh configuration, I presume ...)

Comment on lines 135 to 137
// NOTE: ng.Nodes() gets its values from a direct list call against vmss
// this does not represent nodes that have joined the apiserver just instances
// the nodegroup owns.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat misleading because a) it could be a non-VMSS nodepool and b) it caches listed VMs (so no quite direct). So maybe something like "ng.Nodes() returns cloud provider instances, not Kubernetes Nodes"? (Though to some extent this is captured in the Nodes function description).

Comment on lines -177 to -178
// fetch all the resources since CAS may be operating on mixed nodepools
// including both VMSS and VMs pools

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this comment does belong in front of the function. Maybe "Fetch all resources (VMSS, VMSS VMs, standalone VMs) since autoscaler may be operating on mixed nodepools, including both VMSS and VMs pools."

func (azure *AzureCloudProvider) HasInstance(*apiv1.Node) (bool, error) {
return true, cloudprovider.ErrNotImplemented
func (azure *AzureCloudProvider) HasInstance(node *apiv1.Node) (bool, error) {
ng, err := azure.azureManager.azureCache.FindForInstance(&azureRef{Name: node.Spec.ProviderID}, azure.azureManager.azureCache.vmType)
Copy link

@tallaxes tallaxes Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be reduced to

Suggested change
ng, err := azure.azureManager.azureCache.FindForInstance(&azureRef{Name: node.Spec.ProviderID}, azure.azureManager.azureCache.vmType)
ng, err := azure.azureManager.GetNodeGroupForInstance(&azureRef{Name: node.Spec.ProviderID})

or even further to just

Suggested change
ng, err := azure.azureManager.azureCache.FindForInstance(&azureRef{Name: node.Spec.ProviderID}, azure.azureManager.azureCache.vmType)
ng, err := azure.NodeGroupForNode(node)

But this brings up several questions. What exactly are the semantics of HasInstance and NodeGroupForNode? If HasInstance can be correctly and reliably implemented by NodeGroupForNode (or is it a special property of Azure provider?), why would HasInstance be needed in the CP interface to begin with? NodeGroupForNode returning nil is supposed to mean "instance should not be handled by autoscaler", not that it does not exist. Using it (or equivalent logic) for implementing HasInstance would mean NG node will always be reported as existing, non-NG node will always be reported as deleted; which seems wrong, need to think this through.

@Bryce-Soghigian Bryce-Soghigian force-pushed the bsoghigian/azure/has-instance-impl branch from f0d3407 to ea410de Compare July 12, 2024 23:37
@k8s-ci-robot
Copy link
Contributor

@Bryce-Soghigian: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-autoscaler-e2e-azure ea410de link false /test pull-cluster-autoscaler-e2e-azure

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants