feat: Azure Provider HasInstance implementation #6956

Bryce-Soghigian · 2024-06-21T16:42:58Z

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

CA fails to scale up or cancel in progress schaledown when there are unschedulable pods. Stealing this description from the aws provider implementation.

I think the description of #5054 (comment) explains it well:
...original intent of determining the deleted nodes was incorrect, which led to the issues reported by other users. The nodes tainted with ToBeDeleted were misidentified as Deleted instead of Ready/Unready, which caused a miscalculation of the node being included as Upcoming. This caused problems described in #3949 and #4456.

Which issue(s) this PR fixes:

Special notes for your reviewer:

This PR introduces the HasInstance method to the Azure provider for Cluster Autoscaler. The primary purpose of this method is to ascertain whether a given node has a corresponding instance in the Azure cloud provider. This implementation helps to prevent the undercount of existing VMs and addresses issues related to the taint-based overcount of deleted VMs.

• The HasInstance method ensures that if it is uncertain whether an instance exists, it returns an error instead of false, nil. This approach enforces a fallback to the taint-based determination method, providing a more reliable count of existing VMs.
• If the instance exists: return true, nil
• If the instance does not exist: return *, ErrNotImplemented (consider using a custom error for autoscaled nodes)
• For unimplemented cases: return *, ErrNotImplemented
• For any other errors: return *, error
• ErrNotImplemented is used for silent fallback, while any other errors will be logged for further investigation.

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2024-06-21T16:43:00Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Bryce-Soghigian · 2024-06-21T18:35:45Z

/test all

tallaxes

Overall LGTM, with minor feedback, and some comments re implication of using cache.
How was this tested? Can we add unit tests? E2E tests?

cluster-autoscaler/cloudprovider/azure/azure_cache.go

cluster-autoscaler/cloudprovider/azure/azure_cloud_provider.go

cluster-autoscaler/cloudprovider/azure/azure_cache.go

cluster-autoscaler/cloudprovider/azure/azure_util.go

cluster-autoscaler/cloudprovider/azure/azure_scale_set_test.go

cluster-autoscaler/cloudprovider/azure/azure_cache.go

cluster-autoscaler/cloudprovider/azure/azure_manager.go

cluster-autoscaler/cloudprovider/azure/azure_cache.go

Co-authored-by: Alex Leites <18728999+tallaxes@users.noreply.github.com>

…nodes

Co-authored-by: Alex Leites <18728999+tallaxes@users.noreply.github.com>

… didnt make any sense in the first place

…py path

Co-authored-by: Alex Leites <18728999+tallaxes@users.noreply.github.com>

tallaxes · 2024-07-31T16:59:07Z

/lgtm

k8s-ci-robot requested a review from jackfrancis June 21, 2024 16:43

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 21, 2024

k8s-ci-robot requested a review from nilo19 June 21, 2024 16:43

k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 21, 2024

Bryce-Soghigian marked this pull request as ready for review June 21, 2024 18:38

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 21, 2024

tallaxes reviewed Jun 22, 2024

View reviewed changes

cluster-autoscaler/cloudprovider/azure/azure_cache.go Outdated Show resolved Hide resolved

cluster-autoscaler/cloudprovider/azure/azure_cache.go Outdated Show resolved Hide resolved

cluster-autoscaler/cloudprovider/azure/azure_cache.go Show resolved Hide resolved

tallaxes reviewed Jun 25, 2024

View reviewed changes

cluster-autoscaler/cloudprovider/azure/azure_cloud_provider.go Outdated Show resolved Hide resolved

Bryce-Soghigian force-pushed the bsoghigian/azure/has-instance-impl branch from f0d3407 to ea410de Compare July 12, 2024 23:37

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 16, 2024

rakechill reviewed Jul 24, 2024

View reviewed changes