Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][autoscaler] Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout #48623

Merged
merged 4 commits into from
Nov 11, 2024

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Nov 7, 2024

Why are these changes needed?

Currently, Autoscaler V2 deletes idle nodes without considering min_worker_nodes. This PR skips termination if the current number of nodes of that type is less than or equal to min_worker_nodes.

Reproduce

Related issue number

Change the image with this PR.

image

Note that the screenshot is INFO, but this PR uses DEBUG

image

Closes #47578

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: kaihsun <kaihsun@anyscale.com>
@kevin85421 kevin85421 changed the title [WIP][Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout [WIP][core][autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout Nov 7, 2024
Signed-off-by: kaihsun <kaihsun@anyscale.com>
@kevin85421 kevin85421 changed the title [WIP][core][autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout [core][autoscaler-v2] Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout Nov 7, 2024
Signed-off-by: kaihsun <kaihsun@anyscale.com>
@kevin85421 kevin85421 changed the title [core][autoscaler-v2] Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout [core][autoscaler] Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout Nov 7, 2024
@kevin85421 kevin85421 marked this pull request as ready for review November 7, 2024 23:01
@kevin85421 kevin85421 requested review from hongchaodeng and a team as code owners November 7, 2024 23:01
Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. Should have definitely tested this one.

- terminate_nodes_by_type[node_type]
<= min_count
):
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's add a debug log?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Btw, I have several questions:

  1. I don't find any place in scheduler to call setLevel. How can I set the logger's level to DEBUG when I launch the autoscaler via CLI?

  2. How to determine it should be INFO or DEBUG?

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for 1 - it's using ray configured logging AFAIK so that should be configured with how ray logging level is configured.

For 2 - it's more arbitrary and style.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems no way to configure log level when launching the autoscaler via Ray CLI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems no way to configure log level when launching the autoscaler via Ray CLI.

I will verify whether this is correct. If so, I will open a PR to make it configurable.

Signed-off-by: kaihsun <kaihsun@anyscale.com>
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Nov 8, 2024
@jjyao jjyao merged commit 21308bc into ray-project:master Nov 11, 2024
6 checks passed
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
…count of the worker nodes and constantly terminates after idletimeout (ray-project#48623)

Signed-off-by: kaihsun <kaihsun@anyscale.com>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
…count of the worker nodes and constantly terminates after idletimeout (ray-project#48623)

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
3 participants