Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][autoscaler] Fuse scaling requests together to avoid overloading the Kubernetes API server #49150

Merged
merged 3 commits into from
Dec 9, 2024

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Dec 8, 2024

Why are these changes needed?

Without this PR, the Ray Autoscaler sends a patch request to the Kubernetes (K8s) API server to scale up or down a Ray Pod. That is, if the Ray Autoscaler plans to scale up 10 Pods, 10 patch requests will be sent to the Kubernetes (K8s) API server. This is highly likely to overload the K8s API server when there are multiple Ray clusters within a single K8s cluster.

This PR fuses the requests together to avoid overloading the K8s API server.

Related issue number

Checks

  • Create a Autoscaler V2 RayCluster CR.
    • head Pod: num-cpus: 0
    • worker Pod: Each worker Pod has 1 CPU, and the maxReplicas of the worker group is 10.
  • Run the following script in the head Pod: https://gist.github.com/kevin85421/6f09368ba48572e28f53654dca854b57
  • Results
    • Without this PR, Ray Autoscaler submits 9 patch requests to the K8s API server (from 1 worker Pod -> 10 worker Pods).
      Screenshot 2024-12-07 at 11 29 17 AM
    • With this PR, Ray Autoscaler submits 1 patch request to the K8s API server to scale up 9 worker Pods.
      Screenshot 2024-12-07 at 4 45 10 PM
  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: kaihsun <kaihsun@anyscale.com>
@kevin85421 kevin85421 changed the title [core][autoscaler] Fuse scaling requests together to avoid overloading the Kubernetes API server. [core][autoscaler] Fuse scaling requests together to avoid overloading the Kubernetes API server Dec 8, 2024
Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
@@ -344,6 +344,9 @@ def create_node_with_resources_and_labels(
def _create_node_with_resources_and_labels(
self, node_config, tags, count, resources, labels
):
# This function calls `pop`. To avoid side effects, we make a
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, the first node in the scaling request will be created correctly. However, the following nodes will be created without CPU/GPU because their resources have already been popped from resources.

@kevin85421 kevin85421 marked this pull request as ready for review December 8, 2024 23:27
@kevin85421 kevin85421 requested review from hongchaodeng and a team as code owners December 8, 2024 23:27
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Dec 9, 2024
Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

@rickyyx rickyyx merged commit 7b476ea into ray-project:master Dec 9, 2024
6 checks passed
@kevin85421 kevin85421 assigned kevin85421 and unassigned jjyao and rickyyx Dec 9, 2024
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Dec 17, 2024
…g the Kubernetes API server (ray-project#49150)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Without this PR, the Ray Autoscaler sends a patch request to the
Kubernetes (K8s) API server to scale up or down a Ray Pod. That is, if
the Ray Autoscaler plans to scale up 10 Pods, 10 patch requests will be
sent to the Kubernetes (K8s) API server. This is highly likely to
overload the K8s API server when there are multiple Ray clusters within
a single K8s cluster.

This PR fuses the requests together to avoid overloading the K8s API
server.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks
* Create a Autoscaler V2 RayCluster CR.
  * head Pod: `num-cpus: 0`
* worker Pod: Each worker Pod has 1 CPU, and the `maxReplicas` of the
worker group is 10.
* Run the following script in the head Pod:
https://gist.github.com/kevin85421/6f09368ba48572e28f53654dca854b57
* Results
* Without this PR, Ray Autoscaler submits 9 patch requests to the K8s
API server (from 1 worker Pod -> 10 worker Pods).
<img width="1440" alt="Screenshot 2024-12-07 at 11 29 17 AM"
src="https://github.com/user-attachments/assets/b1757a8c-85df-4d76-a920-c8a81e5b92b2">
* With this PR, Ray Autoscaler submits 1 patch request to the K8s API
server to scale up 9 worker Pods.
<img width="1440" alt="Screenshot 2024-12-07 at 4 45 10 PM"
src="https://github.com/user-attachments/assets/7a42fa56-4671-4b39-bb83-03b0a9a25ec0">

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants