-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Azure Provider HasInstance implementation #6956
base: master
Are you sure you want to change the base?
feat: Azure Provider HasInstance implementation #6956
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Bryce-Soghigian The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, with minor feedback, and some comments re implication of using cache.
How was this tested? Can we add unit tests? E2E tests?
func (m *azureCache) HasInstance(providerID string) bool { | ||
m.mutex.Lock() | ||
defer m.mutex.Unlock() | ||
_, ok := m.instanceToNodeGroup[azureRef{Name: providerID}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are potential issues with case sensitivity here; should use getInstanceFromCache
which ignores case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using cache to respond here makes sense to me in general. Is there anything else we can say about the consequences relying on (two-level?) cache? Would be good to think through the implications of temporarily falsely reporting instance does not exist (if it is not in cache yet), temporarily falsely reporting instance does exist (if it is in cache still), and potential max delay (based on our cache refresh configuration, I presume ...)
// NOTE: ng.Nodes() gets its values from a direct list call against vmss | ||
// this does not represent nodes that have joined the apiserver just instances | ||
// the nodegroup owns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat misleading because a) it could be a non-VMSS nodepool and b) it caches listed VMs (so no quite direct). So maybe something like "ng.Nodes() returns cloud provider instances, not Kubernetes Nodes"? (Though to some extent this is captured in the Nodes function description).
// fetch all the resources since CAS may be operating on mixed nodepools | ||
// including both VMSS and VMs pools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this comment does belong in front of the function. Maybe "Fetch all resources (VMSS, VMSS VMs, standalone VMs) since autoscaler may be operating on mixed nodepools, including both VMSS and VMs pools."
func (azure *AzureCloudProvider) HasInstance(*apiv1.Node) (bool, error) { | ||
return true, cloudprovider.ErrNotImplemented | ||
func (azure *AzureCloudProvider) HasInstance(node *apiv1.Node) (bool, error) { | ||
ng, err := azure.azureManager.azureCache.FindForInstance(&azureRef{Name: node.Spec.ProviderID}, azure.azureManager.azureCache.vmType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be reduced to
ng, err := azure.azureManager.azureCache.FindForInstance(&azureRef{Name: node.Spec.ProviderID}, azure.azureManager.azureCache.vmType) | |
ng, err := azure.azureManager.GetNodeGroupForInstance(&azureRef{Name: node.Spec.ProviderID}) |
or even further to just
ng, err := azure.azureManager.azureCache.FindForInstance(&azureRef{Name: node.Spec.ProviderID}, azure.azureManager.azureCache.vmType) | |
ng, err := azure.NodeGroupForNode(node) |
But this brings up several questions. What exactly are the semantics of HasInstance
and NodeGroupForNode
? If HasInstance
can be correctly and reliably implemented by NodeGroupForNode
(or is it a special property of Azure provider?), why would HasInstance
be needed in the CP interface to begin with? NodeGroupForNode
returning nil is supposed to mean "instance should not be handled by autoscaler", not that it does not exist. Using it (or equivalent logic) for implementing HasInstance
would mean NG node will always be reported as existing, non-NG node will always be reported as deleted; which seems wrong, need to think this through.
f0d3407
to
ea410de
Compare
@Bryce-Soghigian: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind bug
/kind regression
What this PR does / why we need it:
CA fails to scale up or cancel in progress schaledown when there are unschedulable pods. Stealing this description from the aws provider implementation.
I think the description of #5054 (comment) explains it well:
...original intent of determining the deleted nodes was incorrect, which led to the issues reported by other users. The nodes tainted with ToBeDeleted were misidentified as Deleted instead of Ready/Unready, which caused a miscalculation of the node being included as Upcoming. This caused problems described in #3949 and #4456.
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: