Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cluster autoscaler bug requires Azure specific implementation to resolve #4286

Open
claassen opened this issue May 14, 2024 · 1 comment
Assignees
Labels
action-required bug cluster-autoscaler Scale and Performance Use this for any AKS scale or performance related issue

Comments

@claassen
Copy link

Describe the bug

There is an issue in cluster-autoscaler described in kubernetes/autoscaler#4456
which was fixed for some cloud providers but which requires an implementation of the HasInstance method of the AzureCloudProvider to be fixed on AKS.

The gist of the issue is that there are cases when cluster-autoscaler scales down a node but pods can prevent the node from being completely drained and removed (e.g. due to long termination grace periods) and leave the node in a state where cluster-autoscaler still thinks it counts towards the number of available nodes and so does not scale up a new node, but new pods are not able to be scheduled on the old node since it is tainted with ToBeDeletedByClusterAutoscaler which leads to pods getting stuck in Pending and cluster-autoscaler not scaling up a new node for them or cancelling the scale-down of the tainted node.

For some more background:

This issue was attempted to be fixed in kubernetes/autoscaler#4211 and kubernetes/autoscaler#4896 then was then reverted in kubernetes/autoscaler#5023, and fixed again in kubernetes/autoscaler#5054 but this fix does not work for AKS since it relies on cloud provider specific implementation details to be implemented for cluster-autoscaler to know whether a node actually exists or not, based on this comment: kubernetes/autoscaler#5054 (comment)

In order for this fix to work correctly on AKS the following needs to be implemented:

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_cloud_provider.go#L125

similar to how this was implemented on AWS here: kubernetes/autoscaler#5632

To Reproduce
See linked cluster-autoscaler issues

Expected behavior
Cluster autoscaler should be able to use the HasInstance method to determine if the node exists on AKS rather than falling back to the broken logic that relies on the ToBeDeletedByClusterAutoscaler taint

Environment (please complete the following information):
Affects all recent AKS versions as far as I am aware. We are seeing this on 1.27 specifically

Additional context
N/A

@claassen claassen added the bug label May 14, 2024
@pavneeta pavneeta added cluster-autoscaler Scale and Performance Use this for any AKS scale or performance related issue labels May 16, 2024
@pavneeta
Copy link
Contributor

Hi @claassen thanks for creating this issue to report the Bug; AKS team will look into it and revert here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action-required bug cluster-autoscaler Scale and Performance Use this for any AKS scale or performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants