[BUG] Cluster autoscaler bug requires Azure specific implementation to resolve #4286

claassen · 2024-05-14T19:44:35Z

Describe the bug

There is an issue in cluster-autoscaler described in kubernetes/autoscaler#4456
which was fixed for some cloud providers but which requires an implementation of the HasInstance method of the AzureCloudProvider to be fixed on AKS.

The gist of the issue is that there are cases when cluster-autoscaler scales down a node but pods can prevent the node from being completely drained and removed (e.g. due to long termination grace periods) and leave the node in a state where cluster-autoscaler still thinks it counts towards the number of available nodes and so does not scale up a new node, but new pods are not able to be scheduled on the old node since it is tainted with ToBeDeletedByClusterAutoscaler which leads to pods getting stuck in Pending and cluster-autoscaler not scaling up a new node for them or cancelling the scale-down of the tainted node.

For some more background:

This issue was attempted to be fixed in kubernetes/autoscaler#4211 and kubernetes/autoscaler#4896 then was then reverted in kubernetes/autoscaler#5023, and fixed again in kubernetes/autoscaler#5054 but this fix does not work for AKS since it relies on cloud provider specific implementation details to be implemented for cluster-autoscaler to know whether a node actually exists or not, based on this comment: kubernetes/autoscaler#5054 (comment)

In order for this fix to work correctly on AKS the following needs to be implemented:

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_cloud_provider.go#L125

similar to how this was implemented on AWS here: kubernetes/autoscaler#5632

To Reproduce
See linked cluster-autoscaler issues

Expected behavior
Cluster autoscaler should be able to use the HasInstance method to determine if the node exists on AKS rather than falling back to the broken logic that relies on the ToBeDeletedByClusterAutoscaler taint

Environment (please complete the following information):
Affects all recent AKS versions as far as I am aware. We are seeing this on 1.27 specifically

Additional context
N/A

pavneeta · 2024-05-16T17:40:52Z

Hi @claassen thanks for creating this issue to report the Bug; AKS team will look into it and revert here.

claassen added the bug label May 14, 2024

pavneeta assigned pavneeta and kevinkrp93 May 16, 2024

pavneeta added cluster-autoscaler Scale and Performance Use this for any AKS scale or performance related issue labels May 16, 2024

microsoft-github-policy-service bot added the action-required label Jun 10, 2024

tallaxes assigned Bryce-Soghigian Jun 13, 2024

microsoft-github-policy-service bot removed the action-required label Jun 13, 2024

Bryce-Soghigian mentioned this issue Jun 21, 2024

feat: Azure Provider HasInstance implementation kubernetes/autoscaler#6956

Merged

microsoft-github-policy-service bot added the action-required label Jul 8, 2024

This was referenced Aug 22, 2024

chore: backport latest VMs pool and HasInstance() implementations kubernetes/autoscaler#7201

Open

chore: backport latest VMs pool and HasInstance() implementations and resolve inconsistencies in config and unit tests kubernetes/autoscaler#7202

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cluster autoscaler bug requires Azure specific implementation to resolve #4286

[BUG] Cluster autoscaler bug requires Azure specific implementation to resolve #4286

claassen commented May 14, 2024

pavneeta commented May 16, 2024

[BUG] Cluster autoscaler bug requires Azure specific implementation to resolve #4286

[BUG] Cluster autoscaler bug requires Azure specific implementation to resolve #4286

Comments

claassen commented May 14, 2024

pavneeta commented May 16, 2024