-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster autoscaler: Backoff is not persisted after partial scale up failure #2730
Comments
I'm not convinced we should always backoff as soon as at least one node fails to create. I consider a cost of a failed scale-up attempt to be lower than a workload staying pending because CA decided not to scale-up. It's possible that a VM failed to boot or register in Kubernetes for some random reason and retrying the scale-up may still work out just fine. Without any knowledge of error I would err on the side of not backing off after a partially failed scale-up. On a more technical side the problem is we don't have a mechanism that would attribute a particular node to a particular scaleUpRequest (there may be multiple scale-ups on the same node group, nodes could have been added by users or some other controller such as node upgrade, etc). My gut feeling is that doing this properly would require some tricky bookkeeping, but maybe I'm wrong on this. Either way this doesn't sound like a terribly large problem and it's unlikely any member of core team will pick this up anytime soon. If you want to give it a shot, I'm happy to discuss in more detail, review, etc. cc: @losipiuk who worked on backoff more recently than I did and may know this code better |
All the information is there, but there isn't a direct mapping between a ScaleUpRequest and a ScaleUpFailure to get the failure reason. I agree that identifying scale ups that cause OutOfResources errors is about the only improvement that could be made here. Regarding simultaneous scale up requests, in the code it looks like these are merged into one scale up request, since the requests are stored as a map from node group ID to a single request. I tried triggering multiple scale ups in a cluster and what happened to the backoff was the same as if they were all part of the same request to begin with. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Maybe this is intended, and in any case isn't too much of an issue as far as I can tell.
This will always happen during a scale up operation where some but not all new nodes fail to be created, for example when attempting to add two new nodes but the cloud quota only allows for one.
When the autoscaler sees the failed node (from
NodeGroup.Nodes()
) it will apply a backoff to the node group and remove the node. After the node has been removed there is only the healthy node remaining, and when this node is finished creating the autoscaler considers the scale up to have completed successfully, so it resets the backoff on the node group.The autoscaler will then immediately attempt another scale up which will of course fail again because the node group is already at the quota limit. This is where I think the backoff should still be in effect.
The second scale up completely fails which does properly apply a backoff.
I can't think of a situation where a cluster could repeatedly partially fail a scale up, which is why I don't think this is too serious.
In the code, this is because the only places where a scale up request can be completed are here
autoscaler/cluster-autoscaler/clusterstate/clusterstate.go
Lines 220 to 223 in 866c27c
which requires that all nodes failed to be created, in which case the backoff is kept.
Or here
autoscaler/cluster-autoscaler/clusterstate/clusterstate.go
Lines 247 to 255 in 866c27c
which requires that all (remaining) nodes were successfully created, in which case the backoff is removed. This is where the scale up completes when only some of the nodes failed and were removed from the scale up. Perhaps the scale up request could track if some nodes had failed and then act differently here.
Thoughts?
The text was updated successfully, but these errors were encountered: