[core][autoscaler] Fuse scaling requests together to avoid overloadin…

…g the Kubernetes API server (ray-project#49150)   ## Why are these changes needed? Without this PR, the Ray Autoscaler sends a patch request to the Kubernetes (K8s) API server to scale up or down a Ray Pod. That is, if the Ray Autoscaler plans to scale up 10 Pods, 10 patch requests will be sent to the Kubernetes (K8s) API server. This is highly likely to overload the K8s API server when there are multiple Ray clusters within a single K8s cluster. This PR fuses the requests together to avoid overloading the K8s API server. ## Related issue number  ## Checks * Create a Autoscaler V2 RayCluster CR. * head Pod: `num-cpus: 0` * worker Pod: Each worker Pod has 1 CPU, and the `maxReplicas` of the worker group is 10. * Run the following script in the head Pod: https://gist.github.com/kevin85421/6f09368ba48572e28f53654dca854b57 * Results * Without this PR, Ray Autoscaler submits 9 patch requests to the K8s API server (from 1 worker Pod -> 10 worker Pods). <img width="1440" alt="Screenshot 2024-12-07 at 11 29 17 AM" src="https://github.com/user-attachments/assets/b1757a8c-85df-4d76-a920-c8a81e5b92b2"> * With this PR, Ray Autoscaler submits 1 patch request to the K8s API server to scale up 9 worker Pods. <img width="1440" alt="Screenshot 2024-12-07 at 4 45 10 PM" src="https://github.com/user-attachments/assets/7a42fa56-4671-4b39-bb83-03b0a9a25ec0"> - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kaihsun <kaihsun@anyscale.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare-27 · Dec 17, 2024 · 910a074 · 910a074
1 parent 7a8651f
commit 910a074
Show file tree

Hide file tree

Showing 2 changed files with 5 additions and 1 deletion.
diff --git a/python/ray/autoscaler/_private/fake_multi_node/node_provider.py b/python/ray/autoscaler/_private/fake_multi_node/node_provider.py
@@ -344,6 +344,9 @@ def create_node_with_resources_and_labels(
     def _create_node_with_resources_and_labels(
         self, node_config, tags, count, resources, labels
     ):
+        # This function calls `pop`. To avoid side effects, we make a
+        # copy of `resources`.
+        resources = copy.deepcopy(resources)
         with self.lock:
             node_type = tags[TAG_RAY_USER_NODE_TYPE]
             next_id = self._next_hex_node_id()

diff --git a/python/ray/autoscaler/v2/instance_manager/reconciler.py b/python/ray/autoscaler/v2/instance_manager/reconciler.py
@@ -768,12 +768,13 @@ def _handle_instances_launch(
         # Transition the instances to REQUESTED for instance launcher to
         # launch them.
         updates = {}
+        new_launch_request_id = str(uuid.uuid4())
         for instance_type, instances in to_launch.items():
             for instance in instances:
                 # Reuse launch request id for any QUEUED instances that have been
                 # requested before due to retry.
                 launch_request_id = (
-                    str(uuid.uuid4())
+                    new_launch_request_id
                     if len(instance.launch_request_id) == 0
                     else instance.launch_request_id
                 )