Skip to content

Commit

Permalink
[core][autoscaler] Fuse scaling requests together to avoid overloadin…
Browse files Browse the repository at this point in the history
…g the Kubernetes API server (ray-project#49150)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Without this PR, the Ray Autoscaler sends a patch request to the
Kubernetes (K8s) API server to scale up or down a Ray Pod. That is, if
the Ray Autoscaler plans to scale up 10 Pods, 10 patch requests will be
sent to the Kubernetes (K8s) API server. This is highly likely to
overload the K8s API server when there are multiple Ray clusters within
a single K8s cluster.

This PR fuses the requests together to avoid overloading the K8s API
server.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks
* Create a Autoscaler V2 RayCluster CR.
  * head Pod: `num-cpus: 0`
* worker Pod: Each worker Pod has 1 CPU, and the `maxReplicas` of the
worker group is 10.
* Run the following script in the head Pod:
https://gist.github.com/kevin85421/6f09368ba48572e28f53654dca854b57
* Results
* Without this PR, Ray Autoscaler submits 9 patch requests to the K8s
API server (from 1 worker Pod -> 10 worker Pods).
<img width="1440" alt="Screenshot 2024-12-07 at 11 29 17 AM"
src="https://github.com/user-attachments/assets/b1757a8c-85df-4d76-a920-c8a81e5b92b2">
* With this PR, Ray Autoscaler submits 1 patch request to the K8s API
server to scale up 9 worker Pods.
<img width="1440" alt="Screenshot 2024-12-07 at 4 45 10 PM"
src="https://github.com/user-attachments/assets/7a42fa56-4671-4b39-bb83-03b0a9a25ec0">

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
  • Loading branch information
kevin85421 authored and ujjawal-khare committed Dec 17, 2024
1 parent 7a8651f commit 910a074
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,9 @@ def create_node_with_resources_and_labels(
def _create_node_with_resources_and_labels(
self, node_config, tags, count, resources, labels
):
# This function calls `pop`. To avoid side effects, we make a
# copy of `resources`.
resources = copy.deepcopy(resources)
with self.lock:
node_type = tags[TAG_RAY_USER_NODE_TYPE]
next_id = self._next_hex_node_id()
Expand Down
3 changes: 2 additions & 1 deletion python/ray/autoscaler/v2/instance_manager/reconciler.py
Original file line number Diff line number Diff line change
Expand Up @@ -768,12 +768,13 @@ def _handle_instances_launch(
# Transition the instances to REQUESTED for instance launcher to
# launch them.
updates = {}
new_launch_request_id = str(uuid.uuid4())
for instance_type, instances in to_launch.items():
for instance in instances:
# Reuse launch request id for any QUEUED instances that have been
# requested before due to retry.
launch_request_id = (
str(uuid.uuid4())
new_launch_request_id
if len(instance.launch_request_id) == 0
else instance.launch_request_id
)
Expand Down

0 comments on commit 910a074

Please sign in to comment.