add option to keep node group backoff on OutOfResource error #5756

wllbo · 2023-05-12T05:56:10Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces a new flag node-group-keep-backoff-out-of-resources aimed at providing users with control over cluster autoscaler's (CA) backoff strategy when a node group scale up operation fails due to an out-of-resources error from the cloud provider.

With this flag enabled, CA will respect the entire duration of the backoff period for a node group that has hit an out-of-resources error during scale up. This differs from the current default behavior, which could silently and prematurely remove the backoff, leading the CA to repeatedly attempt scaling up an exhausted node group. This could significantly delay scale up time, especially for larger scale ups. This change is useful for clusters where node groups utilize scarce instances or the cloud provider has long wait times for provisioning additional capacity.

By configuring node-group-keep-backoff-out-of-resources in tandem with initial-node-group-backoff-duration and max-node-group-backoff-duration, users could optimize the scale up strategy to minimize attempts and delays by allowing the CA to try scaling up different node groups while the cloud provider provisions additional instances.

Example

priority-expander config:

priorities:
----
50: 
  - .*compute-1.*
40: 
  - .*compute-2.*
30: 
  - .*compute-3.*
20: 
  - .*compute-4.*
10: 
  - .*compute-5.*

CA made ASG scale up attempts in this order: compute-1 -> compute-2 -> compute-3 -> compute-1 -> compute-2. ASGs compute-4 and compute-5 were never attempted before recycling through the out of capacity ASGs. The second attempts to scale compute-1 and compute-2 were made prematurely due to the backoffs being removed after the scale up requests were finished instead of at expiration. With the proposed feature, the CA would respect the entire backoff duration for compute-1, compute-2 and compute-3, potentially allowing for scale up attempts on compute-4 and compute-5.

I0511 21:28:03.431934       1 priority.go:163] priority expander: compute-1-AutoScalingGroup chosen as the highest available
W0511 21:29:05.806767       1 auto_scaling_groups.go:453] Instance group compute-1-AutoScalingGroup cannot provision any more nodes!
I0511 21:29:05.854410       1 clusterstate.go:1084] Failed adding 68 nodes (68 unseen previously) to group compute-1-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:29:05.854451       1 clusterstate.go:294] Disabling scale-up for node group compute-1-AutoScalingGroup until 2023-05-11 21:34:04.970892339 +0000 UTC m=+448.035594703; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled


W0511 21:29:30.133720       1 orchestrator.go:511] Node group compute-1-AutoScalingGroup is not ready for scaleup - backoff
I0511 21:29:30.875841       1 priority.go:163] priority expander: compute-2-AutoScalingGroup chosen as the highest available
W0511 21:30:33.170140       1 auto_scaling_groups.go:453] Instance group compute-2-AutoScalingGroup cannot provision any more nodes!
I0511 21:30:33.215512       1 clusterstate.go:1084] Failed adding 60 nodes (60 unseen previously) to group compute-2-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:30:33.215547       1 clusterstate.go:294] Disabling scale-up for node group compute-2-AutoScalingGroup until 2023-05-11 21:35:32.568995506 +0000 UTC m=+535.633697859; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled


W0511 21:30:54.139919       1 orchestrator.go:511] Node group compute-2-AutoScalingGroup is not ready for scaleup - backoff
W0511 21:30:54.469904       1 orchestrator.go:511] Node group compute-1-AutoScalingGroup is not ready for scaleup - backoff
I0511 21:30:55.164562       1 priority.go:163] priority expander: compute-3-AutoScalingGroup chosen as the highest available
# Removing compute-1-AutoScalingGroup backoff before expiration at 21:34:04.970892339
I0511 21:31:36.935048       1 clusterstate.go:265] Scale up in group compute-1-AutoScalingGroup finished successfully in 3m23.087469244s
W0511 21:31:58.677023       1 auto_scaling_groups.go:453] Instance group compute-3-AutoScalingGroup cannot provision any more nodes!
I0511 21:31:58.724134       1 clusterstate.go:1084] Failed adding 118 nodes (118 unseen previously) to group compute-3-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:31:58.724163       1 clusterstate.go:294] Disabling scale-up for node group compute-3-AutoScalingGroup until 2023-05-11 21:36:57.970873708 +0000 UTC m=+621.035576071; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled

W0511 21:32:29.184151       1 orchestrator.go:511] Node group compute-3-AutoScalingGroup is not ready for scaleup - backoff
W0511 21:32:29.635896       1 orchestrator.go:511] Node group compute-2-AutoScalingGroup is not ready for scaleup - backoff
I0511 21:32:30.634765       1 priority.go:163] priority expander: compute-1-AutoScalingGroup chosen as the highest available
# Removing compute-2-AutoScalingGroup backoff before expiration at 21:35:32.568995506
I0511 21:33:23.049938       1 clusterstate.go:265] Scale up in group compute-2-AutoScalingGroup finished successfully in 3m51.568357773s
W0511 21:33:33.928385       1 auto_scaling_groups.go:453] Instance group compute-1-AutoScalingGroup cannot provision any more nodes!
I0511 21:33:33.979834       1 clusterstate.go:1084] Failed adding 40 nodes (40 unseen previously) to group compute-1-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:33:33.979877       1 clusterstate.go:294] Disabling scale-up for node group compute-1-AutoScalingGroup until 2023-05-11 21:38:33.206839958 +0000 UTC m=+716.271542331; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled


W0511 21:33:50.765746       1 orchestrator.go:511] Node group compute-3-AutoScalingGroup is not ready for scaleup - backoff
W0511 21:33:51.696754       1 orchestrator.go:511] Node group compute-1-AutoScalingGroup is not ready for scaleup - backoff
I0511 21:33:52.256395       1 priority.go:163] priority expander: compute-2-AutoScalingGroup chosen as the highest available
# Removing compute-3-AutoScalingGroup backoff before expiration at 21:36:57.970873708
I0511 21:34:02.528498       1 clusterstate.go:265] Scale up in group compute-3-AutoScalingGroup finished successfully in 2m56.526591463s
W0511 21:34:54.470085       1 auto_scaling_groups.go:453] Instance group compute-2-AutoScalingGroup cannot provision any more nodes!
I0511 21:34:54.520000       1 clusterstate.go:1084] Failed adding 47 nodes (47 unseen previously) to group compute-2-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:34:54.520033       1 clusterstate.go:294] Disabling scale-up for node group compute-2-AutoScalingGroup until 2023-05-11 21:39:53.877496735 +0000 UTC m=+796.942199088; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled

Which issue(s) this PR fixes:

Fixes #2730

Special notes for your reviewer:

All existing unit tests pass
Tested on AWS EKS cluster

Does this PR introduce a user-facing change?

Add flag `node-group-keep-backoff-out-of-resources` which instructs CA to respect the entire duration of the backoff period for a node group that has hit an out-of-resources error during scale up.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla · 2023-05-12T05:56:13Z

The committers listed above are authorized under a signed CLA.

✅ login: wllbo / name: Will Bowers (aa1af03, 8a2cae3, 00fd3a8, 8e867f6, 4477707)

k8s-ci-robot · 2023-05-12T05:56:18Z

Welcome @wllbo!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

Shubham82 · 2023-05-12T11:11:53Z

Hi @wllbo, Please sign the CLA before the PR can be reviewed.
See the following document to sign the CLA: Signing Contributor License Agreements(CLA)

gyanesh-mishra · 2023-08-09T18:15:44Z

@wllbo Excellent job on this! I'd really like this merged onto the main and use the feature.

towca · 2023-09-01T16:35:48Z

/assign @towca

towca · 2023-09-04T15:02:22Z

I fully get the motivation for this change, but IMO the implementation goes in the wrong direction in terms of readability. Unfortunately CSR is both very convoluted, and critical to CA's core functionality - we should be extra careful not to make it less readable than it already is.

The current behavior is IMO surprising while reading the code (I also don't fully get why we need it in the first place, but that's another story). The initial feel when reading the code is that if any node for a given scale-up fails to come up, the node group is backed off. You have to parse a lot of logic to get to the "oh so a partially failed scale-up actually resets the backoff very quickly" realization. This change would add another mental step on top of that - "oh, unless it's a resource-related error, then the backoff is kept).

WDYT about something like below instead? I think it'd increase the readability slightly, while still achieving the same result.

Attach the encountered errors to ScaleUpRequest instead of the backoff. This has the added benefit of not changing the Backoff interface just for this one purpose.
In the part after the scale-up finishes, rename things so that it's not "keeping backoff sometimes", but rather "removing the backoff early sometimes, based on the encountered errors (if there haven't been any likely-to-be-persistent (e.g. OutOfResources) errors encountered)".
I'd also try to explain what's happening in that part in the first place - e.g. a comment like "A node group can be backed off in the middle of an ongoing scale-up because of instance creation errors. If the errors are not likely to be persistent, and the last instance we see comes up successfully, the backoff is removed quickly. This allows CA to retry scaling up the failed instances in the same node group. If we have pods targeting only a particular node group, and the instance creation errors are temporary, this minimizes the time the pods stay pending - at the cost of potentially making one more scale-up that fails.".

Given CSR's nature explained above, this change should also be well-covered with unit tests.

wllbo · 2023-10-18T19:23:39Z

@towca thank you for the detailed feedback on the PR. Based on your suggestions, I've made the following changes:

attached the encountered errors to ScaleUpRequest instead of modifying the backoff itself.
refactored logic to emphasize the early removal of a backoff based on the encountered errors, instead of sometimes "keeping" it.
added a comment explaining the rationale for the behavior, based on the example you provided. This should provide clearer context to future readers of the code.
added additional unit tests to cover the new behavior and changes introduced.

Please let me know if there are any further changes or clarifications I should make

Shubham82 · 2023-11-24T07:30:31Z

@wllbo Please rebase the PR, so that merge conflicts will resolve.

kmsarabu · 2024-02-05T15:29:30Z

@wllbo Can you please rebase the PR to resolve merge conflicts. Thanks in advance.

towca · 2024-02-09T16:45:25Z

Hey @wllbo, sorry for the delay, this PR fell off my radar completely :(

I've discussed this with the author of the original code, and it seems that the current behavior (clearing the backoff whenever a scale-up ends) doesn't make sense anymore. Back in the early days of CA when it was written scale-ups were done 1 node at a time, and there weren't many different node configs available in cloud providers. Clearing the backoff early was the safe choice - there was likely no other node group to fall back to anyway, and so the only cost was potentially 1 more scale-up that would fail.

Now, many more node configurations are available, and CA has to fall back between them - so the cost of clearing the backoff early is less effective fallback. Obviously there can be multiple nodes in a single scale-up now, and we'll probably always see the errors from the VMs that failed before we see the other VMs come up. So we'll just always cut the backoff short in the event of a partial scale-up failure, which wasn't the original intention at all.

All this to say: neither me nor the original author see any clear advantages to resetting the backoff early anymore. Could you just remove that part from updateScaleRequests, instead of making it configurable? This also has the benefit of not adding another flag/config option, which CA has way too many of already. Sorry for wasting your time with the previous suggestion.

wllbo · 2024-02-13T15:47:39Z

Hey @towca, thank you for getting back to me on this, and no worries about the delay.
I completely agree with the benefits of only removing the one line instead. I was hesitant to remove it outright due to its longstanding presence in the code. However, your discussion with the original author and the explanation provided have addressed this concern, thanks for doing that. I've reverted my previous changes, removed the RemoveBackoff line from updateScaleRequests, and updated the test case to reflect this change in logic.

towca · 2024-02-13T18:03:37Z

Thank you!
/lgtm
/approve

k8s-ci-robot · 2024-02-13T18:03:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: towca, wllbo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels May 12, 2023

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label May 12, 2023

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 12, 2023

k8s-ci-robot requested review from BigDarkClown and feiskyer May 12, 2023 05:56

k8s-ci-robot added the area/cluster-autoscaler label May 12, 2023

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 13, 2023

wllbo marked this pull request as ready for review May 13, 2023 02:03

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 13, 2023

k8s-ci-robot requested a review from x13n May 13, 2023 02:04

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 6, 2023

k8s-ci-robot assigned towca Sep 1, 2023

wllbo force-pushed the keep-backoff-out-of-resources branch from 70e1991 to e85eeae Compare October 18, 2023 17:59

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 18, 2023

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 23, 2023

ddelange mentioned this pull request Nov 29, 2023

[cluster-autoscaler] More quickly mark spot ASG in AWS as unavailable if InsufficientInstanceCapacity #3241

Closed

apy-liu mentioned this pull request Jan 31, 2024

[AWS] Unsafe decomissioning of nodes when ASGs are out of instances #5829

Closed

wllbo added 5 commits February 13, 2024 02:04

add option to keep node group backoff on OutOfResource error

aa1af03

remove changes to backoff interface

8a2cae3

attach errors to scale-up request and add comments

00fd3a8

revert optionally keeping node group backoff

8e867f6

remove RemoveBackoff from updateScaleRequests

4477707

wllbo force-pushed the keep-backoff-out-of-resources branch from e85eeae to 4477707 Compare February 13, 2024 15:21

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 13, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2024

k8s-ci-robot merged commit 7031519 into kubernetes:master Feb 13, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add option to keep node group backoff on OutOfResource error #5756

add option to keep node group backoff on OutOfResource error #5756

wllbo commented May 12, 2023 •

edited

Loading

linux-foundation-easycla bot commented May 12, 2023 •

edited

Loading

k8s-ci-robot commented May 12, 2023

Shubham82 commented May 12, 2023

gyanesh-mishra commented Aug 9, 2023

towca commented Sep 1, 2023

towca commented Sep 4, 2023

wllbo commented Oct 18, 2023

Shubham82 commented Nov 24, 2023 •

edited

Loading

kmsarabu commented Feb 5, 2024

towca commented Feb 9, 2024

wllbo commented Feb 13, 2024

towca commented Feb 13, 2024

k8s-ci-robot commented Feb 13, 2024

add option to keep node group backoff on OutOfResource error #5756

add option to keep node group backoff on OutOfResource error #5756

Conversation

wllbo commented May 12, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Example

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla bot commented May 12, 2023 • edited Loading

k8s-ci-robot commented May 12, 2023

Shubham82 commented May 12, 2023

gyanesh-mishra commented Aug 9, 2023

towca commented Sep 1, 2023

towca commented Sep 4, 2023

wllbo commented Oct 18, 2023

Shubham82 commented Nov 24, 2023 • edited Loading

kmsarabu commented Feb 5, 2024

towca commented Feb 9, 2024

wllbo commented Feb 13, 2024

towca commented Feb 13, 2024

k8s-ci-robot commented Feb 13, 2024

wllbo commented May 12, 2023 •

edited

Loading

linux-foundation-easycla bot commented May 12, 2023 •

edited

Loading

Shubham82 commented Nov 24, 2023 •

edited

Loading