[core][autoscaler] Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #48623

kevin85421 · 2024-11-07T09:45:40Z

Why are these changes needed?

Currently, Autoscaler V2 deletes idle nodes without considering min_worker_nodes. This PR skips termination if the current number of nodes of that type is less than or equal to min_worker_nodes.

Reproduce

Install KubeRay v1.2.2
Create a RayCluster
- Autoscaler V2 enabled
- minReplicas is 2
- https://gist.github.com/kevin85421/1a167abb14b9ffa142359d825849567b
The 2 worker Pods will be deleted and recreated every minute.

Related issue number

Change the image with this PR.

Note that the screenshot is INFO, but this PR uses DEBUG

Closes #47578

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: kaihsun <kaihsun@anyscale.com>

rickyyx

Ah, good catch. Should have definitely tested this one.

rickyyx · 2024-11-07T23:31:04Z

python/ray/autoscaler/v2/scheduler.py

+                - terminate_nodes_by_type[node_type]
+                <= min_count
+            ):
+                continue


nit: let's add a debug log?

Updated. Btw, I have several questions:

I don't find any place in scheduler to call setLevel. How can I set the logger's level to DEBUG when I launch the autoscaler via CLI?

How to determine it should be INFO or DEBUG?

Thanks!

I think for 1 - it's using ray configured logging AFAIK so that should be configured with how ray logging level is configured.

For 2 - it's more arbitrary and style.

there seems no way to configure log level when launching the autoscaler via Ray CLI.

there seems no way to configure log level when launching the autoscaler via Ray CLI.

I will verify whether this is correct. If so, I will open a PR to make it configurable.

Signed-off-by: kaihsun <kaihsun@anyscale.com>

…count of the worker nodes and constantly terminates after idletimeout (ray-project#48623) Signed-off-by: kaihsun <kaihsun@anyscale.com>

…count of the worker nodes and constantly terminates after idletimeout (ray-project#48623) Signed-off-by: kaihsun <kaihsun@anyscale.com> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>

fix

ef41792

Signed-off-by: kaihsun <kaihsun@anyscale.com>

improve test

952bc5b

Signed-off-by: kaihsun <kaihsun@anyscale.com>

add comments

dec5647

Signed-off-by: kaihsun <kaihsun@anyscale.com>

kevin85421 marked this pull request as ready for review November 7, 2024 23:01

kevin85421 requested review from hongchaodeng and a team as code owners November 7, 2024 23:01

kevin85421 assigned jjyao Nov 7, 2024

jjyao assigned rickyyx Nov 7, 2024

rickyyx approved these changes Nov 8, 2024

View reviewed changes

update comments

c96035c

Signed-off-by: kaihsun <kaihsun@anyscale.com>

kevin85421 added the go add ONLY when ready to merge, run all tests label Nov 8, 2024

jjyao merged commit 21308bc into ray-project:master Nov 11, 2024
6 checks passed

kevin85421 mentioned this pull request Dec 4, 2024

[Umbrella] Autoscaler improvements ray-project/kuberay#2600

Open

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][autoscaler] Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #48623

[core][autoscaler] Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #48623

kevin85421 commented Nov 7, 2024 •

edited

Loading

rickyyx left a comment

rickyyx Nov 7, 2024

kevin85421 Nov 8, 2024

rickyyx Nov 8, 2024

kevin85421 Nov 9, 2024

kevin85421 Nov 9, 2024

[core][autoscaler] Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #48623

[core][autoscaler] Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #48623

Conversation

kevin85421 commented Nov 7, 2024 • edited Loading

Why are these changes needed?

Reproduce

Related issue number

Checks

rickyyx left a comment

Choose a reason for hiding this comment

rickyyx Nov 7, 2024

Choose a reason for hiding this comment

kevin85421 Nov 8, 2024

Choose a reason for hiding this comment

rickyyx Nov 8, 2024

Choose a reason for hiding this comment

kevin85421 Nov 9, 2024

Choose a reason for hiding this comment

kevin85421 Nov 9, 2024

Choose a reason for hiding this comment

kevin85421 commented Nov 7, 2024 •

edited

Loading