Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to keep node group backoff on OutOfResource error #5756

Merged
merged 5 commits into from
Feb 13, 2024

Conversation

wllbo
Copy link
Contributor

@wllbo wllbo commented May 12, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces a new flag node-group-keep-backoff-out-of-resources aimed at providing users with control over cluster autoscaler's (CA) backoff strategy when a node group scale up operation fails due to an out-of-resources error from the cloud provider.

With this flag enabled, CA will respect the entire duration of the backoff period for a node group that has hit an out-of-resources error during scale up. This differs from the current default behavior, which could silently and prematurely remove the backoff, leading the CA to repeatedly attempt scaling up an exhausted node group. This could significantly delay scale up time, especially for larger scale ups. This change is useful for clusters where node groups utilize scarce instances or the cloud provider has long wait times for provisioning additional capacity.

By configuring node-group-keep-backoff-out-of-resources in tandem with initial-node-group-backoff-duration and max-node-group-backoff-duration, users could optimize the scale up strategy to minimize attempts and delays by allowing the CA to try scaling up different node groups while the cloud provider provisions additional instances.

Example

priority-expander config:

priorities:
----
50: 
  - .*compute-1.*
40: 
  - .*compute-2.*
30: 
  - .*compute-3.*
20: 
  - .*compute-4.*
10: 
  - .*compute-5.*

CA made ASG scale up attempts in this order: compute-1 -> compute-2 -> compute-3 -> compute-1 -> compute-2. ASGs compute-4 and compute-5 were never attempted before recycling through the out of capacity ASGs. The second attempts to scale compute-1 and compute-2 were made prematurely due to the backoffs being removed after the scale up requests were finished instead of at expiration. With the proposed feature, the CA would respect the entire backoff duration for compute-1, compute-2 and compute-3, potentially allowing for scale up attempts on compute-4 and compute-5.

I0511 21:28:03.431934       1 priority.go:163] priority expander: compute-1-AutoScalingGroup chosen as the highest available
W0511 21:29:05.806767       1 auto_scaling_groups.go:453] Instance group compute-1-AutoScalingGroup cannot provision any more nodes!
I0511 21:29:05.854410       1 clusterstate.go:1084] Failed adding 68 nodes (68 unseen previously) to group compute-1-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:29:05.854451       1 clusterstate.go:294] Disabling scale-up for node group compute-1-AutoScalingGroup until 2023-05-11 21:34:04.970892339 +0000 UTC m=+448.035594703; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled


W0511 21:29:30.133720       1 orchestrator.go:511] Node group compute-1-AutoScalingGroup is not ready for scaleup - backoff
I0511 21:29:30.875841       1 priority.go:163] priority expander: compute-2-AutoScalingGroup chosen as the highest available
W0511 21:30:33.170140       1 auto_scaling_groups.go:453] Instance group compute-2-AutoScalingGroup cannot provision any more nodes!
I0511 21:30:33.215512       1 clusterstate.go:1084] Failed adding 60 nodes (60 unseen previously) to group compute-2-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:30:33.215547       1 clusterstate.go:294] Disabling scale-up for node group compute-2-AutoScalingGroup until 2023-05-11 21:35:32.568995506 +0000 UTC m=+535.633697859; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled


W0511 21:30:54.139919       1 orchestrator.go:511] Node group compute-2-AutoScalingGroup is not ready for scaleup - backoff
W0511 21:30:54.469904       1 orchestrator.go:511] Node group compute-1-AutoScalingGroup is not ready for scaleup - backoff
I0511 21:30:55.164562       1 priority.go:163] priority expander: compute-3-AutoScalingGroup chosen as the highest available
# Removing compute-1-AutoScalingGroup backoff before expiration at 21:34:04.970892339
I0511 21:31:36.935048       1 clusterstate.go:265] Scale up in group compute-1-AutoScalingGroup finished successfully in 3m23.087469244s
W0511 21:31:58.677023       1 auto_scaling_groups.go:453] Instance group compute-3-AutoScalingGroup cannot provision any more nodes!
I0511 21:31:58.724134       1 clusterstate.go:1084] Failed adding 118 nodes (118 unseen previously) to group compute-3-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:31:58.724163       1 clusterstate.go:294] Disabling scale-up for node group compute-3-AutoScalingGroup until 2023-05-11 21:36:57.970873708 +0000 UTC m=+621.035576071; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled

W0511 21:32:29.184151       1 orchestrator.go:511] Node group compute-3-AutoScalingGroup is not ready for scaleup - backoff
W0511 21:32:29.635896       1 orchestrator.go:511] Node group compute-2-AutoScalingGroup is not ready for scaleup - backoff
I0511 21:32:30.634765       1 priority.go:163] priority expander: compute-1-AutoScalingGroup chosen as the highest available
# Removing compute-2-AutoScalingGroup backoff before expiration at 21:35:32.568995506
I0511 21:33:23.049938       1 clusterstate.go:265] Scale up in group compute-2-AutoScalingGroup finished successfully in 3m51.568357773s
W0511 21:33:33.928385       1 auto_scaling_groups.go:453] Instance group compute-1-AutoScalingGroup cannot provision any more nodes!
I0511 21:33:33.979834       1 clusterstate.go:1084] Failed adding 40 nodes (40 unseen previously) to group compute-1-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:33:33.979877       1 clusterstate.go:294] Disabling scale-up for node group compute-1-AutoScalingGroup until 2023-05-11 21:38:33.206839958 +0000 UTC m=+716.271542331; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled


W0511 21:33:50.765746       1 orchestrator.go:511] Node group compute-3-AutoScalingGroup is not ready for scaleup - backoff
W0511 21:33:51.696754       1 orchestrator.go:511] Node group compute-1-AutoScalingGroup is not ready for scaleup - backoff
I0511 21:33:52.256395       1 priority.go:163] priority expander: compute-2-AutoScalingGroup chosen as the highest available
# Removing compute-3-AutoScalingGroup backoff before expiration at 21:36:57.970873708
I0511 21:34:02.528498       1 clusterstate.go:265] Scale up in group compute-3-AutoScalingGroup finished successfully in 2m56.526591463s
W0511 21:34:54.470085       1 auto_scaling_groups.go:453] Instance group compute-2-AutoScalingGroup cannot provision any more nodes!
I0511 21:34:54.520000       1 clusterstate.go:1084] Failed adding 47 nodes (47 unseen previously) to group compute-2-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:34:54.520033       1 clusterstate.go:294] Disabling scale-up for node group compute-2-AutoScalingGroup until 2023-05-11 21:39:53.877496735 +0000 UTC m=+796.942199088; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled

Which issue(s) this PR fixes:

Fixes #2730

Special notes for your reviewer:

  • All existing unit tests pass
  • Tested on AWS EKS cluster

Does this PR introduce a user-facing change?

Add flag `node-group-keep-backoff-out-of-resources` which instructs CA to respect the entire duration of the backoff period for a node group that has hit an out-of-resources error during scale up.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels May 12, 2023
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented May 12, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label May 12, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @wllbo!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 12, 2023
@Shubham82
Copy link
Contributor

Hi @wllbo, Please sign the CLA before the PR can be reviewed.
See the following document to sign the CLA: Signing Contributor License Agreements(CLA)

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 13, 2023
@wllbo wllbo marked this pull request as ready for review May 13, 2023 02:03
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 13, 2023
@k8s-ci-robot k8s-ci-robot requested a review from x13n May 13, 2023 02:04
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 6, 2023
@gyanesh-mishra
Copy link

@wllbo Excellent job on this! I'd really like this merged onto the main and use the feature.

@towca
Copy link
Collaborator

towca commented Sep 1, 2023

/assign @towca

@towca
Copy link
Collaborator

towca commented Sep 4, 2023

I fully get the motivation for this change, but IMO the implementation goes in the wrong direction in terms of readability. Unfortunately CSR is both very convoluted, and critical to CA's core functionality - we should be extra careful not to make it less readable than it already is.

The current behavior is IMO surprising while reading the code (I also don't fully get why we need it in the first place, but that's another story). The initial feel when reading the code is that if any node for a given scale-up fails to come up, the node group is backed off. You have to parse a lot of logic to get to the "oh so a partially failed scale-up actually resets the backoff very quickly" realization. This change would add another mental step on top of that - "oh, unless it's a resource-related error, then the backoff is kept).

WDYT about something like below instead? I think it'd increase the readability slightly, while still achieving the same result.

  1. Attach the encountered errors to ScaleUpRequest instead of the backoff. This has the added benefit of not changing the Backoff interface just for this one purpose.
  2. In the part after the scale-up finishes, rename things so that it's not "keeping backoff sometimes", but rather "removing the backoff early sometimes, based on the encountered errors (if there haven't been any likely-to-be-persistent (e.g. OutOfResources) errors encountered)".
  3. I'd also try to explain what's happening in that part in the first place - e.g. a comment like "A node group can be backed off in the middle of an ongoing scale-up because of instance creation errors. If the errors are not likely to be persistent, and the last instance we see comes up successfully, the backoff is removed quickly. This allows CA to retry scaling up the failed instances in the same node group. If we have pods targeting only a particular node group, and the instance creation errors are temporary, this minimizes the time the pods stay pending - at the cost of potentially making one more scale-up that fails.".

Given CSR's nature explained above, this change should also be well-covered with unit tests.

@wllbo wllbo force-pushed the keep-backoff-out-of-resources branch from 70e1991 to e85eeae Compare October 18, 2023 17:59
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 18, 2023
@wllbo
Copy link
Contributor Author

wllbo commented Oct 18, 2023

@towca thank you for the detailed feedback on the PR. Based on your suggestions, I've made the following changes:

  • attached the encountered errors to ScaleUpRequest instead of modifying the backoff itself.
  • refactored logic to emphasize the early removal of a backoff based on the encountered errors, instead of sometimes "keeping" it.
  • added a comment explaining the rationale for the behavior, based on the example you provided. This should provide clearer context to future readers of the code.
  • added additional unit tests to cover the new behavior and changes introduced.

Please let me know if there are any further changes or clarifications I should make

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2023
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 23, 2023
@Shubham82
Copy link
Contributor

Shubham82 commented Nov 24, 2023

@wllbo Please rebase the PR, so that merge conflicts will resolve.

@kmsarabu
Copy link

kmsarabu commented Feb 5, 2024

@wllbo Can you please rebase the PR to resolve merge conflicts. Thanks in advance.

@towca
Copy link
Collaborator

towca commented Feb 9, 2024

Hey @wllbo, sorry for the delay, this PR fell off my radar completely :(

I've discussed this with the author of the original code, and it seems that the current behavior (clearing the backoff whenever a scale-up ends) doesn't make sense anymore. Back in the early days of CA when it was written scale-ups were done 1 node at a time, and there weren't many different node configs available in cloud providers. Clearing the backoff early was the safe choice - there was likely no other node group to fall back to anyway, and so the only cost was potentially 1 more scale-up that would fail.

Now, many more node configurations are available, and CA has to fall back between them - so the cost of clearing the backoff early is less effective fallback. Obviously there can be multiple nodes in a single scale-up now, and we'll probably always see the errors from the VMs that failed before we see the other VMs come up. So we'll just always cut the backoff short in the event of a partial scale-up failure, which wasn't the original intention at all.

All this to say: neither me nor the original author see any clear advantages to resetting the backoff early anymore. Could you just remove that part from updateScaleRequests, instead of making it configurable? This also has the benefit of not adding another flag/config option, which CA has way too many of already. Sorry for wasting your time with the previous suggestion.

@wllbo wllbo force-pushed the keep-backoff-out-of-resources branch from e85eeae to 4477707 Compare February 13, 2024 15:21
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 13, 2024
@wllbo
Copy link
Contributor Author

wllbo commented Feb 13, 2024

Hey @towca, thank you for getting back to me on this, and no worries about the delay.
I completely agree with the benefits of only removing the one line instead. I was hesitant to remove it outright due to its longstanding presence in the code. However, your discussion with the original author and the explanation provided have addressed this concern, thanks for doing that. I've reverted my previous changes, removed the RemoveBackoff line from updateScaleRequests, and updated the test case to reflect this change in logic.

@towca
Copy link
Collaborator

towca commented Feb 13, 2024

Thank you!
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: towca, wllbo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2024
@k8s-ci-robot k8s-ci-robot merged commit 7031519 into kubernetes:master Feb 13, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster autoscaler: Backoff is not persisted after partial scale up failure
6 participants