Cluster autoscaler: Backoff is not persisted after partial scale up failure #2730

tghartland · 2020-01-13T10:24:31Z

Maybe this is intended, and in any case isn't too much of an issue as far as I can tell.

This will always happen during a scale up operation where some but not all new nodes fail to be created, for example when attempting to add two new nodes but the cloud quota only allows for one.

When the autoscaler sees the failed node (from NodeGroup.Nodes()) it will apply a backoff to the node group and remove the node. After the node has been removed there is only the healthy node remaining, and when this node is finished creating the autoscaler considers the scale up to have completed successfully, so it resets the backoff on the node group.

The autoscaler will then immediately attempt another scale up which will of course fail again because the node group is already at the quota limit. This is where I think the backoff should still be in effect.

The second scale up completely fails which does properly apply a backoff.

I can't think of a situation where a cluster could repeatedly partially fail a scale up, which is why I don't think this is too serious.

In the code, this is because the only places where a scale up request can be completed are here

autoscaler/cluster-autoscaler/clusterstate/clusterstate.go

Lines 220 to 223 in 866c27c

    
           if scaleUpRequest.Increase+delta <= 0 { 
        
           	// increase <= 0 means that there is no scale-up intent really 
        
           	delete(csr.scaleUpRequests, nodeGroup.Id()) 
        
           	return

which requires that all nodes failed to be created, in which case the backoff is kept.

Or here

autoscaler/cluster-autoscaler/clusterstate/clusterstate.go

Lines 247 to 255 in 866c27c

    
           if !csr.areThereUpcomingNodesInNodeGroup(nodeGroupName) { 
        
           	// scale-out finished successfully 
        
           	// remove it and reset node group backoff 
        
           	delete(csr.scaleUpRequests, nodeGroupName) 
        
           	csr.backoff.RemoveBackoff(scaleUpRequest.NodeGroup, csr.nodeInfosForGroups[scaleUpRequest.NodeGroup.Id()]) 
        
           	klog.V(4).Infof("Scale up in group %v finished successfully in %v", 
        
           		nodeGroupName, currentTime.Sub(scaleUpRequest.Time)) 
        
           	continue 
        
           }

which requires that all (remaining) nodes were successfully created, in which case the backoff is removed. This is where the scale up completes when only some of the nodes failed and were removed from the scale up. Perhaps the scale up request could track if some nodes had failed and then act differently here.

Thoughts?

The text was updated successfully, but these errors were encountered:

MaciekPytel · 2020-01-17T12:45:50Z

I'm not convinced we should always backoff as soon as at least one node fails to create. I consider a cost of a failed scale-up attempt to be lower than a workload staying pending because CA decided not to scale-up. It's possible that a VM failed to boot or register in Kubernetes for some random reason and retrying the scale-up may still work out just fine. Without any knowledge of error I would err on the side of not backing off after a partially failed scale-up.
That being said in case of quota errors retrying is very unlikely to succeed and it seems reasonable to backoff immediately. We already have a notion of OutOfResources error and maybe we could backoff more aggressively on those?

On a more technical side the problem is we don't have a mechanism that would attribute a particular node to a particular scaleUpRequest (there may be multiple scale-ups on the same node group, nodes could have been added by users or some other controller such as node upgrade, etc). My gut feeling is that doing this properly would require some tricky bookkeeping, but maybe I'm wrong on this.

Either way this doesn't sound like a terribly large problem and it's unlikely any member of core team will pick this up anytime soon. If you want to give it a shot, I'm happy to discuss in more detail, review, etc.

cc: @losipiuk who worked on backoff more recently than I did and may know this code better

tghartland · 2020-01-17T15:59:58Z

All the information is there, but there isn't a direct mapping between a ScaleUpRequest and a ScaleUpFailure to get the failure reason. I agree that identifying scale ups that cause OutOfResources errors is about the only improvement that could be made here.

Regarding simultaneous scale up requests, in the code it looks like these are merged into one scale up request, since the requests are stored as a map from node group ID to a single request. I tried triggering multiple scale ups in a cluster and what happened to the backoff was the same as if they were all part of the same request to begin with.

fejta-bot · 2020-04-16T16:54:05Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-05-16T17:38:24Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-06-15T18:22:27Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-06-15T18:22:42Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MaciekPytel added the area/cluster-autoscaler label Jan 17, 2020

MaciekPytel added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Jan 17, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 16, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 16, 2020

k8s-ci-robot closed this as completed Jun 15, 2020

wllbo mentioned this issue May 12, 2023

add option to keep node group backoff on OutOfResource error #5756

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster autoscaler: Backoff is not persisted after partial scale up failure #2730

Cluster autoscaler: Backoff is not persisted after partial scale up failure #2730

tghartland commented Jan 13, 2020

MaciekPytel commented Jan 17, 2020

tghartland commented Jan 17, 2020

fejta-bot commented Apr 16, 2020

fejta-bot commented May 16, 2020

fejta-bot commented Jun 15, 2020

k8s-ci-robot commented Jun 15, 2020

Cluster autoscaler: Backoff is not persisted after partial scale up failure #2730

Cluster autoscaler: Backoff is not persisted after partial scale up failure #2730

Comments

tghartland commented Jan 13, 2020

MaciekPytel commented Jan 17, 2020

tghartland commented Jan 17, 2020

fejta-bot commented Apr 16, 2020

fejta-bot commented May 16, 2020

fejta-bot commented Jun 15, 2020

k8s-ci-robot commented Jun 15, 2020