Not autoscaled node groups are treated as deleted #5022

x13n · 2022-07-13T08:06:40Z

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: running from current HEAD

What k8s version are you using (kubectl version)?: 1.24

kubectl version Output

$ kubectl version

What environment is this in?: GKE

What did you expect to happen?: CA should treat upcoming nodes in non-autoscaled node groups as upcoming, not deleted, which would prevent unnecessary scale up.

What happened instead?: CA triggers a scale up even though there are nodes in non-autoscaled node groups that could run the pods.

How to reproduce it (as minimally and precisely as possible): In GKE, create a cluster with default nodepool that is not autoscaled and enable NAP. Observe NAP create a new nodepool. Sometimes it won't happen, if scheduler manages to schedule all the pods before CA kicks in.

Anything else we need to know?: The way deleted nodes are detected changed in #4896, we should probably roll it back and figure this problem out before reapplying the change.

/cc @fookenc @MaciekPytel

The text was updated successfully, but these errors were encountered:

x13n · 2022-07-13T09:39:07Z

Chatted with @MaciekPytel about this. It should be sufficient to detect not autoscaled node groups (and stop marking them as deleted) in the same way scale down is doing it: by checking if NodeGroupForNode(node) is nil:

autoscaler/cluster-autoscaler/core/scaledown/legacy/legacy.go

Lines 313 to 322 in 5745044

    
           nodeGroup, err := sd.context.CloudProvider.NodeGroupForNode(node) 
        
           if err != nil { 
        
           	return simulator.UnexpectedError, nil 
        
           } 
        
           if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() { 
        
           	// We should never get here as non-autoscaled nodes should not be included in scaleDownCandidates list 
        
           	// (and the default PreFilteringScaleDownNodeProcessor would indeed filter them out). 
        
           	klog.Warningf("Skipped %s from delete consideration - the node is not autoscaled", node.Name) 
        
           	return simulator.NotAutoscaled, nil 
        
           }

Perhaps worth wrapping this check into a function and using it in both places.

fookenc · 2022-07-18T20:00:48Z

Hi @x13n & @MaciekPytel,

From my local testing, I'm not sure that NodeGroupForNode will solve the issue. I found that after a node is deleted from the cloud provider, the NodeGroupForNode also returns nil. Unfortunately, this makes deleted nodes and not autoscaled nodes appear the same.

Is there another way to determine not autoscaled nodes?

fookenc · 2022-07-28T00:23:45Z

I've submitted a new PR #5054 to reintroduce the code changes that were reverted and address the issue detailed here. The changes also include new scenarios in the test case for this functionality. It includes not autoscaled nodes (nodes without a node group) to ensure that they are no longer incorrectly flagged as deleted.

Please review the new changes and notify if there are areas of concern or improvement.

k8s-triage-robot · 2022-10-26T00:46:23Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

x13n · 2022-10-26T11:22:11Z

/remove-lifecycle stale

k8s-triage-robot · 2023-01-24T12:01:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

x13n · 2023-01-27T16:46:23Z

This was fixed.

/close

k8s-ci-robot · 2023-01-27T16:46:28Z

@x13n: Closing this issue.

In response to this:

This was fixed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

x13n added the kind/bug Categorizes issue or PR as related to a bug. label Jul 13, 2022

x13n mentioned this issue Jul 13, 2022

Revert "Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes" #5023

Merged

fookenc mentioned this issue Jul 14, 2022

Replacing DeletedNode struct usage with *apiv1.Node directly #5008

Closed

jbartosik added the area/cluster-autoscaler label Jul 18, 2022

fookenc mentioned this issue Jul 27, 2022

Identifying cloud provider deleted nodes #5054

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 26, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 26, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2023

k8s-ci-robot closed this as completed Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not autoscaled node groups are treated as deleted #5022

Not autoscaled node groups are treated as deleted #5022

x13n commented Jul 13, 2022

x13n commented Jul 13, 2022

fookenc commented Jul 18, 2022

fookenc commented Jul 28, 2022

k8s-triage-robot commented Oct 26, 2022

x13n commented Oct 26, 2022

k8s-triage-robot commented Jan 24, 2023

x13n commented Jan 27, 2023

k8s-ci-robot commented Jan 27, 2023

Not autoscaled node groups are treated as deleted #5022

Not autoscaled node groups are treated as deleted #5022

Comments

x13n commented Jul 13, 2022

x13n commented Jul 13, 2022

fookenc commented Jul 18, 2022

fookenc commented Jul 28, 2022

k8s-triage-robot commented Oct 26, 2022

x13n commented Oct 26, 2022

k8s-triage-robot commented Jan 24, 2023

x13n commented Jan 27, 2023

k8s-ci-robot commented Jan 27, 2023