Identifying cloud provider deleted nodes #5054

fookenc · 2022-07-27T18:41:24Z

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind bug

What this PR does / why we need it:

Attempts to address the issue raised in #5022 from previously approved PR. Original implementation (previous work was reverted #4896 & #5023) included that changes how Deleted nodes are determined in the cluster state. New iteration utilizes the previously cached Cloud Provider nodes in the ClusterStateRegistry to perform a diff. Cloud Provider node instances are retrieved by Node Group, which should avoid not autoscaled nodes from being flagged erroneously. Test case has been modified to include this scenario. An additional scenario has been included to ensure that previously identified Cloud Provider nodes will still be tracked until they are no longer registered in Kubernetes.

This PR includes the initial work first introduced in #4211. Feedback from that PR indicated that the original intent of determining the deleted nodes was incorrect, which led to the issues reported by other users. The nodes tainted with ToBeDeleted were misidentified as Deleted instead of Ready/Unready, which caused a miscalculation of the node being included as Upcoming. This caused problems described in #3949 and #4456.

Which issue(s) this PR fixes:

Fixes #3949, #4456, #5022

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

fookenc · 2022-07-27T19:37:07Z

/assign @x13n

fookenc · 2022-07-27T19:37:19Z

/assign @MaciekPytel

x13n · 2022-08-10T12:14:33Z

cluster-autoscaler/clusterstate/clusterstate.go

+	// Seek nodes that may have been deleted since last update
+	// cloudProviderNodeInstances are retrieved by nodeGroup,
+	// not autoscaled nodes will not be included
+	for _, instances := range csr.cloudProviderNodeInstances {


Won't relying on previous state cause issues in case of CA restart? IIUC some nodes will be incorrectly considered as Ready/Unready/NotStarted, potentially leading to bad scaling decisions.

That would definitely occur on a restart. Would it make sense to create a new taint and add it onto the nodes? There might be some additional overhead for the implementation, but should prevent that scenario.

I also thought about adding a check for each node in the deletedNodes (lines 1001 - 1006) to confirm it doesn't appear in the cloud provider nodes. I don't have good insight to know if that is a case that could happen though. I wouldn't want to keep flagging a node as deleted if the cloud provider node exists.

I don't particularly like the idea of introducing more taints to carry over CA internal state between restarts. That feels brittle and complex. I think maybe it is ok to just leave it as is: if the instance was deleted right before CA restart, there will be a slight risk of not scaling up until k8s API catches up with the deletion. This should be fine.

Re-checking deletedNodes would catch a case in which instance disappears from cloud provider due to some temporary problem and then comes back. Indeed now we would treat it as deleted forever and it would be better to avoid code that cannot recover from such scenario (even though it is unlikely).

That makes sense to me. I agree that the API should eventually correct itself, and the scaling should only momentarily be impacted. I've created a new commit that includes the backfill safety check mentioned and some other minor cosmetic code changes.

x13n

Thanks for the changes!

I just realized one more problem with this approach though. Since we cannot tell the difference between nodes that were deleted and the ones that are not autoscaled, this implementation is going to treat nodes as deleted whenever someone opts out a node group from autoscaling. As a result, CA may trigger large scale ups without a good reason. One way to handle this would be to add a TTL to deletedNodes to avoid keeping nodes there forever in this scenario. (Perhaps an even better option would be some extension of cloud provider API to distinguish deleted from not autoscaled, but I'm not sure that's feasible.) WDYT?

fookenc · 2022-08-18T20:26:40Z

That is definitely a concern. I can add a TTL to the deletedNodes entries. Should I add a configurable value for better control or just a common default value?

I had a similar thought about extending the cloud provider implementation. After a brief investigation, it looks like most of the interactions are handled by using information in the node (the node.Spec.ProviderID specifically). It doesn't appear to validate whether the node exists in the cloud. A possible change could be to provide more context of deleted nodes versus not-autoscaled nodes within the NodeGroupForNode function.

x13n · 2022-08-19T08:09:07Z

I think extending the API would be much cleaner, but the need to implement it for all cloud providers calls for a broader discussion. I added this topic to SIG meeting agenda to discuss it on Monday.

x13n · 2022-08-23T14:55:09Z

So I guess the conclusion after yesterday's SIG meeting is that we should extend cloud provider interface and preserve the existing (taint-based) behavior in case a specific cloud provider doesn't implement the new function. That way each cloud provider will be able to fix the bug by implementing a function distinguishing deleted from non-autoscaled nodes.

fookenc · 2022-08-24T00:48:24Z

Should this PR be closed until the cloud provider interface can be extended? What would be the next steps on fixing the bug in the future? Please let me know if there's anything I can offer for assistance. Thanks!

x13n · 2022-08-24T11:04:21Z

The cloud provider interface can be extended right now, it's just that all implementations would have to get default implementation that returns NotImplemented error. I thought this PR could be re-purposed into extending the interface and using it to detect deleted nodes. Each cloud provider would then need to implement their part in a separate PR. Are you willing to make that first, cloud provider agnostic, change?

fookenc · 2022-08-27T01:28:49Z

Got it! I'll start working on a pass at the implementation and submit it as soon as I can. Thanks again!

x13n · 2022-09-16T07:37:39Z

If this requires more time, maybe it would make sense to submit change to taint cleanup (ready -> all nodes) on startup separately first?

fookenc · 2022-09-19T18:49:41Z

Sorry for the delay here. I've split the taint removal into another PR #5200, and will continue working on the effort here. Please let me know if there's anything else needed. Thanks!

x13n · 2022-09-20T11:49:15Z

Thanks! The other PR already got merged.

…d provider that are still registered within Kubernetes. Avoids misidentifying not autoscaled nodes as deleted. Simplified implementation to use apiv1.Node instead of new struct. Expanded test cases to include not autoscaled nodes and tracking deleted nodes over multiple updates. Adding check to backfill loop to confirm cloud provider node no longer exists before flagging the node as deleted. Modifying some comments to be more accurate. Replacing erroneous line deletion.

* Adding isNodeDeleted method to CloudProvider interface. Supports detecting whether nodes are fully deleted or are not-autoscaled. Updated cloud providers to provide initial implementation of new method that will return an ErrNotImplemented to maintain existing taint-based deletion clusterstate calculation.

x13n · 2022-10-19T13:42:36Z

Thanks for the changes! My main concern is whether simplifying the logic is feasible - I think some of the previous approach got carried over, but isn't really necessary with the new interface.

cluster-autoscaler/clusterstate/clusterstate.go

…eRegistry, and remove old complicated logic. Adjust the naming of the method for cloud instance deletion from NodeExists to HasInstance.

cluster-autoscaler/clusterstate/clusterstate.go

…tance. Changing deletedNodes to store empty struct instead of node values, and modifying the helper function to utilize that information for tests.

…names.

x13n · 2022-12-09T19:47:58Z

Thanks for following up on this!

/lgtm

x13n · 2022-12-16T13:53:35Z

Hm, looks like I should be able to approve this now as well, let's see:

/approve

k8s-ci-robot · 2022-12-16T13:53:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fookenc, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [x13n]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ohnoitsjmo · 2023-01-12T18:13:08Z

@fookenc Will this release fix be backported to previous CA versions (including 1.22)?

fookenc · 2023-01-13T01:12:41Z

Unfortunately, this release alone won't fix issues until the cloud providers have implemented the new interface method. Once the cloud providers have been implemented it could be backported, but I don't know how soon that will occur. @x13n do you know if this could be backported to earlier CA versions?

x13n · 2023-01-16T14:45:00Z

1.22 is past k8s support window at this point, so unless there's a critical bug to fix I don't think any backport is going to be released.

- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 Signed-off-by: vadasambar <suraj.bankar@acquia.com>

- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 Signed-off-by: vadasambar <suraj.bankar@acquia.com> fix: make `HasInstance` in aws provider thread-safe Signed-off-by: vadasambar <suraj.bankar@acquia.com>

- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 fix: make `HasInstance` in aws provider thread-safe Signed-off-by: vadasambar <suraj.bankar@acquia.com> (cherry picked from commit 1cb55fe)

- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 Signed-off-by: vadasambar <suraj.bankar@acquia.com> fix: make `HasInstance` in aws provider thread-safe Signed-off-by: vadasambar <suraj.bankar@acquia.com> (cherry picked from commit 1cb55fe)

- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 Signed-off-by: vadasambar <suraj.bankar@acquia.com> fix: make `HasInstance` in aws provider thread-safe Signed-off-by: vadasambar <suraj.bankar@acquia.com>

- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 Signed-off-by: vadasambar <suraj.bankar@acquia.com> fix: make `HasInstance` in aws provider thread-safe Signed-off-by: vadasambar <suraj.bankar@acquia.com> (cherry picked from commit 1cb55fe)

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 27, 2022

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer July 27, 2022 18:41

k8s-ci-robot assigned x13n Jul 27, 2022

k8s-ci-robot assigned MaciekPytel Jul 27, 2022

fookenc mentioned this pull request Jul 28, 2022

Not autoscaled node groups are treated as deleted #5022

Closed

jbartosik added the area/cluster-autoscaler label Aug 2, 2022

x13n reviewed Aug 10, 2022

View reviewed changes

x13n mentioned this pull request Aug 11, 2022

Subtract toBeDeleted nodes from number of upcoming nodes; cleanup toBeDeleted taints from all nodes, not only ready ones #4211

Closed

x13n reviewed Aug 16, 2022

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 20, 2022

fookenc closed this Oct 17, 2022

fookenc force-pushed the fix-autoscaler-node-deletion branch from 48b6328 to f445a6a Compare October 17, 2022 21:39

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 17, 2022

x13n requested changes Oct 27, 2022

View reviewed changes

fookenc added 2 commits November 4, 2022 17:54

Changing deletion logic to rely on a new helper method in ClusterStat…

08dfc7e

…eRegistry, and remove old complicated logic. Adjust the naming of the method for cloud instance deletion from NodeExists to HasInstance.

Fixing go formatting issue in cloudstack cloud provider code.

ab4fff6

x13n reviewed Nov 28, 2022

View reviewed changes

cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved

cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved

cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved

fookenc added 2 commits December 5, 2022 12:44

Updating error messaging and fallback behavior of hasCloudProviderIns…

1198fbc

…tance. Changing deletedNodes to store empty struct instead of node values, and modifying the helper function to utilize that information for tests.

Fixing helper function to simplify for loop to retrieve deleted node …

c94740f

…names.

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 9, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2022

k8s-ci-robot merged commit ba3b244 into kubernetes:master Dec 16, 2022

vadasambar mentioned this pull request Mar 14, 2023

Cluster Autoscaler Forgets Nodes Scheduled for Deletion during Restart #5048

Open

vadasambar added a commit to vadasambar/autoscaler that referenced this pull request Mar 29, 2023

fix: implement function to identify if node is present in aws

4dcbc14

- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 Signed-off-by: vadasambar <suraj.bankar@acquia.com>

vadasambar mentioned this pull request Mar 29, 2023

fix: implement function to identify if node is present in aws #5632

Merged

himanshu-kun mentioned this pull request May 5, 2023

[Regression] MaxRetryTimeout should be respected while scaling machineDeployment gardener/autoscaler#213

Closed

2 tasks

MaxFedotov mentioned this pull request Apr 15, 2024

fix: implement function to identify if node is present in cluster-api #6708

Merged

claassen mentioned this pull request May 14, 2024

[BUG] Cluster autoscaler bug requires Azure specific implementation to resolve Azure/AKS#4286

Open

Bryce-Soghigian mentioned this pull request Jun 21, 2024

feat: Azure Provider HasInstance implementation #6956

Merged

alfredkrohmer mentioned this pull request Aug 12, 2024

fix: implement HasInstance() for OCI providers #7154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identifying cloud provider deleted nodes #5054

Identifying cloud provider deleted nodes #5054

fookenc commented Jul 27, 2022

fookenc commented Jul 27, 2022

fookenc commented Jul 27, 2022

x13n Aug 10, 2022

fookenc Aug 10, 2022

x13n Aug 12, 2022

fookenc Aug 12, 2022

x13n left a comment

fookenc commented Aug 18, 2022

x13n commented Aug 19, 2022

x13n commented Aug 23, 2022

fookenc commented Aug 24, 2022

x13n commented Aug 24, 2022

fookenc commented Aug 27, 2022

x13n commented Sep 16, 2022

fookenc commented Sep 19, 2022

x13n commented Sep 20, 2022

x13n commented Oct 19, 2022

x13n commented Dec 9, 2022

x13n commented Dec 16, 2022

k8s-ci-robot commented Dec 16, 2022

ohnoitsjmo commented Jan 12, 2023 •

edited

Loading

fookenc commented Jan 13, 2023

x13n commented Jan 16, 2023

Identifying cloud provider deleted nodes #5054

Identifying cloud provider deleted nodes #5054

Conversation

fookenc commented Jul 27, 2022

Which component this PR applies to?

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

fookenc commented Jul 27, 2022

fookenc commented Jul 27, 2022

x13n Aug 10, 2022

Choose a reason for hiding this comment

fookenc Aug 10, 2022

Choose a reason for hiding this comment

x13n Aug 12, 2022

Choose a reason for hiding this comment

fookenc Aug 12, 2022

Choose a reason for hiding this comment

x13n left a comment

Choose a reason for hiding this comment

fookenc commented Aug 18, 2022

x13n commented Aug 19, 2022

x13n commented Aug 23, 2022

fookenc commented Aug 24, 2022

x13n commented Aug 24, 2022

fookenc commented Aug 27, 2022

x13n commented Sep 16, 2022

fookenc commented Sep 19, 2022

x13n commented Sep 20, 2022

x13n commented Oct 19, 2022

x13n commented Dec 9, 2022

x13n commented Dec 16, 2022

k8s-ci-robot commented Dec 16, 2022

ohnoitsjmo commented Jan 12, 2023 • edited Loading

fookenc commented Jan 13, 2023

x13n commented Jan 16, 2023

ohnoitsjmo commented Jan 12, 2023 •

edited

Loading