-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[core][autoscaler] Autoscaler doesn't scale up correctly when the Kub…
…eRay RayCluster is not in the goal state (#48909) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? ### Issue * Create a Autoscaler V2 RayCluster CR. * head Pod: `num-cpus: 0` * worker Pod: Each worker Pod has 1 CPU, and the `maxReplicas` of the worker group is 10. * Run the following script in the head Pod: https://gist.github.com/kevin85421/6f09368ba48572e28f53654dca854b57 * There are 10 scale requests to add a new node. However, only some of them will be created (e.g., 5). ### Reason In the reproduction script above, the `cloud_instance_updater` will send a request to scale up one worker Pod 10 times because the `maxReplicas` of the worker group is set to 10. However, the construction of the scale_request depends on the Pods in the Kubernetes cluster. For example, * cluster state: RayCluster Replicas: 2, Ray Pods: 1 * 1st scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1) * 2nd scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1) --> **this should be 3!** The above example is expected to create 3 Pods. However, it will ultimately create only 2 Pods. ### Solution Use RayCluster CR instead of Ray Pods to build scale requests. ## Related issue number Closes #46473 ## Checks 10 worker Pods are created successfully. <img width="1373" alt="Screenshot 2024-11-24 at 2 11 39 AM" src="https://github.com/user-attachments/assets/c42c6cdd-3bf0-4aa9-a928-630c12ff5569"> - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kaihsun <kaihsun@anyscale.com>
- Loading branch information
1 parent
0f2c62c
commit ed3d48c
Showing
4 changed files
with
152 additions
and
16 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters