Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster autoscaler: Backoff is not persisted after partial scale up failure #2730

Closed
tghartland opened this issue Jan 13, 2020 · 6 comments · Fixed by #5756
Closed

Cluster autoscaler: Backoff is not persisted after partial scale up failure #2730

tghartland opened this issue Jan 13, 2020 · 6 comments · Fixed by #5756
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@tghartland
Copy link
Contributor

Maybe this is intended, and in any case isn't too much of an issue as far as I can tell.

This will always happen during a scale up operation where some but not all new nodes fail to be created, for example when attempting to add two new nodes but the cloud quota only allows for one.

When the autoscaler sees the failed node (from NodeGroup.Nodes()) it will apply a backoff to the node group and remove the node. After the node has been removed there is only the healthy node remaining, and when this node is finished creating the autoscaler considers the scale up to have completed successfully, so it resets the backoff on the node group.

The autoscaler will then immediately attempt another scale up which will of course fail again because the node group is already at the quota limit. This is where I think the backoff should still be in effect.

The second scale up completely fails which does properly apply a backoff.

I can't think of a situation where a cluster could repeatedly partially fail a scale up, which is why I don't think this is too serious.

In the code, this is because the only places where a scale up request can be completed are here

if scaleUpRequest.Increase+delta <= 0 {
// increase <= 0 means that there is no scale-up intent really
delete(csr.scaleUpRequests, nodeGroup.Id())
return

which requires that all nodes failed to be created, in which case the backoff is kept.

Or here

if !csr.areThereUpcomingNodesInNodeGroup(nodeGroupName) {
// scale-out finished successfully
// remove it and reset node group backoff
delete(csr.scaleUpRequests, nodeGroupName)
csr.backoff.RemoveBackoff(scaleUpRequest.NodeGroup, csr.nodeInfosForGroups[scaleUpRequest.NodeGroup.Id()])
klog.V(4).Infof("Scale up in group %v finished successfully in %v",
nodeGroupName, currentTime.Sub(scaleUpRequest.Time))
continue
}

which requires that all (remaining) nodes were successfully created, in which case the backoff is removed. This is where the scale up completes when only some of the nodes failed and were removed from the scale up. Perhaps the scale up request could track if some nodes had failed and then act differently here.

Thoughts?

@MaciekPytel
Copy link
Contributor

I'm not convinced we should always backoff as soon as at least one node fails to create. I consider a cost of a failed scale-up attempt to be lower than a workload staying pending because CA decided not to scale-up. It's possible that a VM failed to boot or register in Kubernetes for some random reason and retrying the scale-up may still work out just fine. Without any knowledge of error I would err on the side of not backing off after a partially failed scale-up.
That being said in case of quota errors retrying is very unlikely to succeed and it seems reasonable to backoff immediately. We already have a notion of OutOfResources error and maybe we could backoff more aggressively on those?

On a more technical side the problem is we don't have a mechanism that would attribute a particular node to a particular scaleUpRequest (there may be multiple scale-ups on the same node group, nodes could have been added by users or some other controller such as node upgrade, etc). My gut feeling is that doing this properly would require some tricky bookkeeping, but maybe I'm wrong on this.

Either way this doesn't sound like a terribly large problem and it's unlikely any member of core team will pick this up anytime soon. If you want to give it a shot, I'm happy to discuss in more detail, review, etc.

cc: @losipiuk who worked on backoff more recently than I did and may know this code better

@MaciekPytel MaciekPytel added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Jan 17, 2020
@tghartland
Copy link
Contributor Author

All the information is there, but there isn't a direct mapping between a ScaleUpRequest and a ScaleUpFailure to get the failure reason. I agree that identifying scale ups that cause OutOfResources errors is about the only improvement that could be made here.

Regarding simultaneous scale up requests, in the code it looks like these are merged into one scale up request, since the requests are stored as a map from node group ID to a single request. I tried triggering multiple scale ups in a cluster and what happened to the backoff was the same as if they were all part of the same request to begin with.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 16, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 16, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants