Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifying cloud provider deleted nodes #5054

Merged
merged 9 commits into from
Dec 16, 2022

Conversation

fookenc
Copy link
Contributor

@fookenc fookenc commented Jul 27, 2022

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind bug

What this PR does / why we need it:

Attempts to address the issue raised in #5022 from previously approved PR. Original implementation (previous work was reverted #4896 & #5023) included that changes how Deleted nodes are determined in the cluster state. New iteration utilizes the previously cached Cloud Provider nodes in the ClusterStateRegistry to perform a diff. Cloud Provider node instances are retrieved by Node Group, which should avoid not autoscaled nodes from being flagged erroneously. Test case has been modified to include this scenario. An additional scenario has been included to ensure that previously identified Cloud Provider nodes will still be tracked until they are no longer registered in Kubernetes.

This PR includes the initial work first introduced in #4211. Feedback from that PR indicated that the original intent of determining the deleted nodes was incorrect, which led to the issues reported by other users. The nodes tainted with ToBeDeleted were misidentified as Deleted instead of Ready/Unready, which caused a miscalculation of the node being included as Upcoming. This caused problems described in #3949 and #4456.

Which issue(s) this PR fixes:

Fixes #3949, #4456, #5022

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 27, 2022
@fookenc
Copy link
Contributor Author

fookenc commented Jul 27, 2022

/assign @x13n

@fookenc
Copy link
Contributor Author

fookenc commented Jul 27, 2022

/assign @MaciekPytel

// Seek nodes that may have been deleted since last update
// cloudProviderNodeInstances are retrieved by nodeGroup,
// not autoscaled nodes will not be included
for _, instances := range csr.cloudProviderNodeInstances {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't relying on previous state cause issues in case of CA restart? IIUC some nodes will be incorrectly considered as Ready/Unready/NotStarted, potentially leading to bad scaling decisions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would definitely occur on a restart. Would it make sense to create a new taint and add it onto the nodes? There might be some additional overhead for the implementation, but should prevent that scenario.

I also thought about adding a check for each node in the deletedNodes (lines 1001 - 1006) to confirm it doesn't appear in the cloud provider nodes. I don't have good insight to know if that is a case that could happen though. I wouldn't want to keep flagging a node as deleted if the cloud provider node exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularly like the idea of introducing more taints to carry over CA internal state between restarts. That feels brittle and complex. I think maybe it is ok to just leave it as is: if the instance was deleted right before CA restart, there will be a slight risk of not scaling up until k8s API catches up with the deletion. This should be fine.

Re-checking deletedNodes would catch a case in which instance disappears from cloud provider due to some temporary problem and then comes back. Indeed now we would treat it as deleted forever and it would be better to avoid code that cannot recover from such scenario (even though it is unlikely).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I agree that the API should eventually correct itself, and the scaling should only momentarily be impacted. I've created a new commit that includes the backfill safety check mentioned and some other minor cosmetic code changes.

Copy link
Member

@x13n x13n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

I just realized one more problem with this approach though. Since we cannot tell the difference between nodes that were deleted and the ones that are not autoscaled, this implementation is going to treat nodes as deleted whenever someone opts out a node group from autoscaling. As a result, CA may trigger large scale ups without a good reason. One way to handle this would be to add a TTL to deletedNodes to avoid keeping nodes there forever in this scenario. (Perhaps an even better option would be some extension of cloud provider API to distinguish deleted from not autoscaled, but I'm not sure that's feasible.) WDYT?

@fookenc
Copy link
Contributor Author

fookenc commented Aug 18, 2022

That is definitely a concern. I can add a TTL to the deletedNodes entries. Should I add a configurable value for better control or just a common default value?

I had a similar thought about extending the cloud provider implementation. After a brief investigation, it looks like most of the interactions are handled by using information in the node (the node.Spec.ProviderID specifically). It doesn't appear to validate whether the node exists in the cloud. A possible change could be to provide more context of deleted nodes versus not-autoscaled nodes within the NodeGroupForNode function.

@x13n
Copy link
Member

x13n commented Aug 19, 2022

I think extending the API would be much cleaner, but the need to implement it for all cloud providers calls for a broader discussion. I added this topic to SIG meeting agenda to discuss it on Monday.

@x13n
Copy link
Member

x13n commented Aug 23, 2022

So I guess the conclusion after yesterday's SIG meeting is that we should extend cloud provider interface and preserve the existing (taint-based) behavior in case a specific cloud provider doesn't implement the new function. That way each cloud provider will be able to fix the bug by implementing a function distinguishing deleted from non-autoscaled nodes.

@fookenc
Copy link
Contributor Author

fookenc commented Aug 24, 2022

Should this PR be closed until the cloud provider interface can be extended? What would be the next steps on fixing the bug in the future? Please let me know if there's anything I can offer for assistance. Thanks!

@x13n
Copy link
Member

x13n commented Aug 24, 2022

The cloud provider interface can be extended right now, it's just that all implementations would have to get default implementation that returns NotImplemented error. I thought this PR could be re-purposed into extending the interface and using it to detect deleted nodes. Each cloud provider would then need to implement their part in a separate PR. Are you willing to make that first, cloud provider agnostic, change?

@fookenc
Copy link
Contributor Author

fookenc commented Aug 27, 2022

Got it! I'll start working on a pass at the implementation and submit it as soon as I can. Thanks again!

@x13n
Copy link
Member

x13n commented Sep 16, 2022

If this requires more time, maybe it would make sense to submit change to taint cleanup (ready -> all nodes) on startup separately first?

@fookenc
Copy link
Contributor Author

fookenc commented Sep 19, 2022

Sorry for the delay here. I've split the taint removal into another PR #5200, and will continue working on the effort here. Please let me know if there's anything else needed. Thanks!

@x13n
Copy link
Member

x13n commented Sep 20, 2022

Thanks! The other PR already got merged.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 20, 2022
@fookenc fookenc closed this Oct 17, 2022
@fookenc fookenc force-pushed the fix-autoscaler-node-deletion branch from 48b6328 to f445a6a Compare October 17, 2022 21:39
…d provider that are still registered within Kubernetes. Avoids misidentifying not autoscaled nodes as deleted. Simplified implementation to use apiv1.Node instead of new struct. Expanded test cases to include not autoscaled nodes and tracking deleted nodes over multiple updates.

Adding check to backfill loop to confirm cloud provider node no longer exists before flagging the node as deleted. Modifying some comments to be more accurate. Replacing erroneous line deletion.
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 17, 2022
* Adding isNodeDeleted method to CloudProvider interface. Supports detecting whether nodes are fully deleted or are not-autoscaled. Updated cloud providers to provide initial implementation of new method that will return an ErrNotImplemented to maintain existing taint-based deletion clusterstate calculation.
@x13n
Copy link
Member

x13n commented Oct 19, 2022

Thanks for the changes! My main concern is whether simplifying the logic is feasible - I think some of the previous approach got carried over, but isn't really necessary with the new interface.

cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved
…eRegistry, and remove old complicated logic. Adjust the naming of the method for cloud instance deletion from NodeExists to HasInstance.
cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved
…tance. Changing deletedNodes to store empty struct instead of node values, and modifying the helper function to utilize that information for tests.
@x13n
Copy link
Member

x13n commented Dec 9, 2022

Thanks for following up on this!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 9, 2022
@x13n
Copy link
Member

x13n commented Dec 16, 2022

Hm, looks like I should be able to approve this now as well, let's see:

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fookenc, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2022
@k8s-ci-robot k8s-ci-robot merged commit ba3b244 into kubernetes:master Dec 16, 2022
@ohnoitsjmo
Copy link

ohnoitsjmo commented Jan 12, 2023

@fookenc Will this release fix be backported to previous CA versions (including 1.22)?

@fookenc
Copy link
Contributor Author

fookenc commented Jan 13, 2023

Unfortunately, this release alone won't fix issues until the cloud providers have implemented the new interface method. Once the cloud providers have been implemented it could be backported, but I don't know how soon that will occur. @x13n do you know if this could be backported to earlier CA versions?

@x13n
Copy link
Member

x13n commented Jan 16, 2023

1.22 is past k8s support window at this point, so unless there's a critical bug to fix I don't think any backport is going to be released.

vadasambar added a commit to vadasambar/autoscaler that referenced this pull request Mar 29, 2023
- this is a follow-up to kubernetes#5054
- this might fix kubernetes#4456

Signed-off-by: vadasambar <suraj.bankar@acquia.com>
vadasambar added a commit to vadasambar/autoscaler that referenced this pull request Mar 31, 2023
- this is a follow-up to kubernetes#5054
- this might fix kubernetes#4456

Signed-off-by: vadasambar <suraj.bankar@acquia.com>

fix: make `HasInstance` in aws provider thread-safe

Signed-off-by: vadasambar <suraj.bankar@acquia.com>
vadasambar added a commit to vadasambar/autoscaler that referenced this pull request May 22, 2023
- this is a follow-up to kubernetes#5054
- this might fix kubernetes#4456

fix: make `HasInstance` in aws provider thread-safe

Signed-off-by: vadasambar <suraj.bankar@acquia.com>
(cherry picked from commit 1cb55fe)
vadasambar added a commit to vadasambar/autoscaler that referenced this pull request May 22, 2023
- this is a follow-up to kubernetes#5054
- this might fix kubernetes#4456

fix: make `HasInstance` in aws provider thread-safe

Signed-off-by: vadasambar <suraj.bankar@acquia.com>
(cherry picked from commit 1cb55fe)
vadasambar added a commit to vadasambar/autoscaler that referenced this pull request May 22, 2023
- this is a follow-up to kubernetes#5054
- this might fix kubernetes#4456

Signed-off-by: vadasambar <suraj.bankar@acquia.com>

fix: make `HasInstance` in aws provider thread-safe

Signed-off-by: vadasambar <suraj.bankar@acquia.com>
(cherry picked from commit 1cb55fe)
James-QiuHaoran pushed a commit to James-QiuHaoran/autoscaler that referenced this pull request Jul 29, 2023
- this is a follow-up to kubernetes#5054
- this might fix kubernetes#4456

Signed-off-by: vadasambar <suraj.bankar@acquia.com>

fix: make `HasInstance` in aws provider thread-safe

Signed-off-by: vadasambar <suraj.bankar@acquia.com>
jigish pushed a commit to airbnb/autoscaler that referenced this pull request Feb 6, 2024
- this is a follow-up to kubernetes#5054
- this might fix kubernetes#4456

Signed-off-by: vadasambar <suraj.bankar@acquia.com>

fix: make `HasInstance` in aws provider thread-safe

Signed-off-by: vadasambar <suraj.bankar@acquia.com>
(cherry picked from commit 1cb55fe)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Node groups get "stuck" during node deletion
6 participants