-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add option to keep node group backoff on OutOfResource error #5756
add option to keep node group backoff on OutOfResource error #5756
Conversation
Welcome @wllbo! |
Hi @wllbo, Please sign the CLA before the PR can be reviewed. |
@wllbo Excellent job on this! I'd really like this merged onto the main and use the feature. |
/assign @towca |
I fully get the motivation for this change, but IMO the implementation goes in the wrong direction in terms of readability. Unfortunately CSR is both very convoluted, and critical to CA's core functionality - we should be extra careful not to make it less readable than it already is. The current behavior is IMO surprising while reading the code (I also don't fully get why we need it in the first place, but that's another story). The initial feel when reading the code is that if any node for a given scale-up fails to come up, the node group is backed off. You have to parse a lot of logic to get to the "oh so a partially failed scale-up actually resets the backoff very quickly" realization. This change would add another mental step on top of that - "oh, unless it's a resource-related error, then the backoff is kept). WDYT about something like below instead? I think it'd increase the readability slightly, while still achieving the same result.
Given CSR's nature explained above, this change should also be well-covered with unit tests. |
70e1991
to
e85eeae
Compare
@towca thank you for the detailed feedback on the PR. Based on your suggestions, I've made the following changes:
Please let me know if there are any further changes or clarifications I should make |
@wllbo Please rebase the PR, so that merge conflicts will resolve. |
@wllbo Can you please rebase the PR to resolve merge conflicts. Thanks in advance. |
Hey @wllbo, sorry for the delay, this PR fell off my radar completely :( I've discussed this with the author of the original code, and it seems that the current behavior (clearing the backoff whenever a scale-up ends) doesn't make sense anymore. Back in the early days of CA when it was written scale-ups were done 1 node at a time, and there weren't many different node configs available in cloud providers. Clearing the backoff early was the safe choice - there was likely no other node group to fall back to anyway, and so the only cost was potentially 1 more scale-up that would fail. Now, many more node configurations are available, and CA has to fall back between them - so the cost of clearing the backoff early is less effective fallback. Obviously there can be multiple nodes in a single scale-up now, and we'll probably always see the errors from the VMs that failed before we see the other VMs come up. So we'll just always cut the backoff short in the event of a partial scale-up failure, which wasn't the original intention at all. All this to say: neither me nor the original author see any clear advantages to resetting the backoff early anymore. Could you just remove that part from |
e85eeae
to
4477707
Compare
Hey @towca, thank you for getting back to me on this, and no worries about the delay. |
Thank you! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: towca, wllbo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR introduces a new flag
node-group-keep-backoff-out-of-resources
aimed at providing users with control over cluster autoscaler's (CA) backoff strategy when a node group scale up operation fails due to an out-of-resources error from the cloud provider.With this flag enabled, CA will respect the entire duration of the backoff period for a node group that has hit an out-of-resources error during scale up. This differs from the current default behavior, which could silently and prematurely remove the backoff, leading the CA to repeatedly attempt scaling up an exhausted node group. This could significantly delay scale up time, especially for larger scale ups. This change is useful for clusters where node groups utilize scarce instances or the cloud provider has long wait times for provisioning additional capacity.
By configuring
node-group-keep-backoff-out-of-resources
in tandem withinitial-node-group-backoff-duration
andmax-node-group-backoff-duration
, users could optimize the scale up strategy to minimize attempts and delays by allowing the CA to try scaling up different node groups while the cloud provider provisions additional instances.Example
priority-expander config:
CA made ASG scale up attempts in this order: compute-1 -> compute-2 -> compute-3 -> compute-1 -> compute-2. ASGs compute-4 and compute-5 were never attempted before recycling through the out of capacity ASGs. The second attempts to scale compute-1 and compute-2 were made prematurely due to the backoffs being removed after the scale up requests were finished instead of at expiration. With the proposed feature, the CA would respect the entire backoff duration for compute-1, compute-2 and compute-3, potentially allowing for scale up attempts on compute-4 and compute-5.
Which issue(s) this PR fixes:
Fixes #2730
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: