[Serve] Separate the serve scheduing logic into its own class #36588

jjyao · 2023-06-20T18:00:03Z

Why are these changes needed?

Separate the serve scheduling logic into it's own class and switch to batch scheduling for making better scheduling decisions.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao · 2023-06-21T21:11:11Z

The PR is ready for first round of review. In the meantime, I'll fix and add more tests.

jjyao · 2023-06-21T21:14:59Z

python/ray/serve/_private/deployment_state.py

-
-            # Check if the model_id has changed.
-            running_replicas_changed |= self._multiplexed_model_ids_updated
+            self._check_and_update_replicas()


I flip the order between _check_and_update_replicas and _scale_deployment_replicas since I think we should first refresh the status of existing replicas and then scale based on the latest status. For example, its possible that _check_and_update_replicas stopped a replica so that _scale_deployment_replicas needs to start a new one.

That should be fine; I don't think this ordering is important. The only downside is it may take an extra iteration for deployment to initially go from starting -> running.

Yea, that's true if starting/initialization is super fast.

jjyao · 2023-06-21T21:17:24Z

python/ray/serve/_private/deployment_state.py

+            self._deployment_states[deployment_name].stop_replicas(replicas_to_stop)
+
+        for deployment_name, deployment_state in self._deployment_states.items():
+            if set(running_replica_infos_before_update[deployment_name]) != set(


We can just decide whether to notify by comparing before and after.

edoakes

Approach looks good for a first cut

python/ray/serve/_private/deployment_state.py

edoakes · 2023-06-21T22:44:44Z

python/ray/serve/_private/deployment_state.py

-
-            # Check if the model_id has changed.
-            running_replicas_changed |= self._multiplexed_model_ids_updated
+            self._check_and_update_replicas()


That should be fine; I don't think this ordering is important. The only downside is it may take an extra iteration for deployment to initially go from starting -> running.

python/ray/serve/_private/deployment_scheduler.py

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

python/ray/serve/_private/deployment_scheduler.py

python/ray/serve/_private/deployment_state.py

sihanwang41 · 2023-06-23T20:58:05Z

python/ray/serve/_private/deployment_scheduler.py

+        upscales: Dict[str, List[ReplicaSchedulingRequest]],
+        downscales: Dict[str, DeploymentDownscaleRequest],
+    ):
+        """This is called for each update cycle to do batch scheduling.


Could you help me understand why we need to have self._pending_replicas?
For spread policy, is it always guaranteed that scheduled request will be consumed inside the schedule() call?

You are right that for spread policy, all the schedule requests can be fulfilled immediately inside the schedule() call. But for driver deployment policy, it might not be the case: e.g., if there are recovering replicas, scheduler won't schedule new replicas to avoid multiple replicas on the same node until future schedule() call when all replicas are recovered.

sihanwang41 · 2023-06-23T21:06:41Z

python/ray/serve/_private/deployment_scheduler.py

+        """
+        replicas_to_stop = set()
+
+        pending_launching_recovering_replicas = set().union(


this might be behavior change in this case.
Why do we want to remove the non-running replicas first? I think It is not align Prioritize replicas that have fewest copies on a node. in the comments.

Its the same behavior: the code will try to prioritize replicas without node id first.

Comment from the original code:

# Replicas not in running state might have _node_id = None. # We will prioritize those first.

python/ray/serve/_private/deployment_scheduler.py

sihanwang41 · 2023-06-23T21:18:22Z

python/ray/serve/_private/deployment_scheduler.py

+    It makes scheduling decisions in a batch mode for each update cycle.
+    """
+
+    def __init__(self, gcs_client: Optional[GcsClient] = None):


I am hesitant to let the scheduler managing the replica status which is duplicated with DeploymentState responsibility. I may miss some context, can you help me to understand why scheduler need to manage all the replica status. (I was thinking to let DeploymentState to call the scheduler to schedule replicas & remove replicas).

Deployment scheduler has it's own state machine which is different from the deployment state one: the state is based on whether we know the node id of the replica since node id information is important for scheduling implementation. For example, deployment scheduler doesn't have a UPDATING state since from scheduler's point of view it's the same as RUNNING (i.e the replica node id is known).

jjyao · 2023-06-25T04:00:35Z

long_running_serve and long_running_serve_failure passed: https://buildkite.com/ray-project/release-tests-pr/builds/43414#_

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

shrekris-anyscale

Nice work so far! I left some comments.

python/ray/serve/_private/deployment_scheduler.py

python/ray/serve/_private/deployment_state.py

shrekris-anyscale · 2023-06-26T20:54:24Z

python/ray/serve/_private/deployment_state.py

-    def _notify_running_replicas_changed(self):
+    def notify_running_replicas_changed(self) -> None:
+        running_replica_infos = self.get_running_replica_infos()
+        if (


To confirm, this isn't a behavior change right? This is an optimization to reduce the number of times we send a notification with the LongPollHost?

There is no behavior change.

Sounds good, thanks.

is this related to the scheduling change? or just an independent change?

if unrelated, please separate into its own PR.

Yea, it's related.

python/ray/serve/_private/deployment_state.py

python/ray/serve/_private/deployment_scheduler.py

python/ray/serve/tests/test_deployment_scheduler.py

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

shrekris-anyscale

Nice work!

edoakes · 2023-06-27T21:07:43Z

python/ray/serve/_private/deployment_scheduler.py

+            # so that we can make sure we don't schedule two replicas on the same node.
+            return
+
+        all_nodes = {node_id for node_id, _ in get_all_node_ids(self._gcs_client)}


why's the custom gcs_client necessary here?

Because get_all_node_ids needs gcs_client as argument.

edoakes · 2023-06-27T21:34:04Z

python/ray/serve/_private/deployment_state.py

-    def _notify_running_replicas_changed(self):
+    def notify_running_replicas_changed(self) -> None:
+        running_replica_infos = self.get_running_replica_infos()
+        if (


is this related to the scheduling change? or just an independent change?

if unrelated, please separate into its own PR.

python/ray/serve/_private/deployment_state.py

edoakes · 2023-06-27T21:37:29Z

python/ray/serve/_private/deployment_state.py

+                for replica in self._replicas.pop(states=[ReplicaState.STARTING]):
+                    self._stop_replica(replica)
+
+            return upscale


confused by this return -- why are we early returning upscale list in the downscale case?

is this just to early return with an empty list?

Yea, just early return an empty list. I changed to return an empty list directly to be obvious

edoakes · 2023-06-27T21:38:56Z

python/ray/serve/_private/deployment_state.py

+        num_existing_replicas = self._replicas.count()
+        if num_existing_replicas >= self._target_state.num_replicas:
+            num_running_replicas = self._replicas.count(states=[ReplicaState.RUNNING])
+            if num_running_replicas >= self._target_state.num_replicas:
+                for replica in self._replicas.pop(states=[ReplicaState.STARTING]):
+                    self._stop_replica(replica)


shouldn't this logic live in the scheduler rather than here as part of the downscale request?

It's not the normal downscaling (driver deployment doesn't downscale based on qps). It's about cancelling extra replicas.

I added some comments to make it clear:

# Cancel starting replicas when driver deployment state creates # more replicas than alive nodes. # For example, get_all_node_ids returns 4 nodes when # the driver deployment state decides the target number of replicas # but later on when the deployment scheduler schedules these 4 replicas, # there are only 3 alive nodes (1 node dies in between). # In this case, 1 replica will be in the PENDING_ALLOCATION and we # cancel it here.

python/ray/serve/_private/deployment_state.py

edoakes · 2023-06-27T21:40:18Z

python/ray/serve/_private/deployment_state.py

+            deleted, recovering, upscale, downscale = deployment_state.update()
+            if upscale:
+                upscales[deployment_name] = upscale
+            if downscale:
+                downscales[deployment_name] = downscale


Suggested change

deleted, recovering, upscale, downscale = deployment_state.update()

if upscale:

upscales[deployment_name] = upscale

if downscale:

downscales[deployment_name] = downscale

deleted, recovering, upscales[deployment_name], downscales[deployment_name] = deployment_state.update()

I don't want to add to upscales/downscales dict if the deployment has no upscale or downscale request.

python/ray/serve/tests/test_deployment_scheduler.py

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

edoakes

LGTM, let's merge after branch cut!

…oject#36588) Separate the serve scheduling logic into it's own class and switch to batch scheduling for making better scheduling decisions. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

jjyao added 7 commits June 20, 2023 10:58

Separate the serve scheduing logic into its own class

18cb50c

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

0ddad3e

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

06e1ada

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

ff4c616

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

2b8047a

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

f790da7

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

9f6bca2

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao changed the title ~~[Serve] Separate the serve scheduing logic into its own class~~ [WIP] [Serve] Separate the serve scheduing logic into its own class Jun 21, 2023

jjyao marked this pull request as ready for review June 21, 2023 21:09

jjyao requested review from zcin, edoakes and sihanwang41 June 21, 2023 21:09

jjyao assigned sihanwang41, edoakes and zcin Jun 21, 2023

jjyao commented Jun 21, 2023

View reviewed changes

edoakes reviewed Jun 21, 2023

View reviewed changes

jjyao added 5 commits June 21, 2023 17:31

up

104df7b

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

tests

e9f9333

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Merge branch 'master' of github.com:ray-project/ray into jjyao/separate

42511c3

up

01e0cef

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

8fca862

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao requested a review from edoakes June 22, 2023 06:08

fix

58e372c

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

scottjlee assigned scottjlee and unassigned sihanwang41, edoakes and zcin Jun 22, 2023

jjyao added 3 commits June 22, 2023 14:03

fix

4c4cbcf

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

fix

b0e9a46

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Merge branch 'master' of github.com:ray-project/ray into jjyao/separate

327b21e

sihanwang41 reviewed Jun 23, 2023

View reviewed changes

jjyao added 2 commits June 25, 2023 21:49

comments

6d0c1d2

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

tests

c9d197c

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao changed the title ~~[WIP] [Serve] Separate the serve scheduing logic into its own class~~ [Serve] Separate the serve scheduing logic into its own class Jun 26, 2023

jjyao requested a review from sihanwang41 June 26, 2023 16:52

Merge branch 'master' of github.com:ray-project/ray into jjyao/separate

fc98129

shrekris-anyscale reviewed Jun 26, 2023

View reviewed changes

jjyao added 2 commits June 26, 2023 14:53

fix

3059be7

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

comments

cf9d195

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

shrekris-anyscale mentioned this pull request Jun 26, 2023

[Serve] Ray Serve Autoscaler does not relinquish cluster nodes when scaling down #36749

Closed

up

5bd9b4b

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao requested a review from shrekris-anyscale June 27, 2023 04:35

shrekris-anyscale approved these changes Jun 27, 2023

View reviewed changes

edoakes reviewed Jun 27, 2023

View reviewed changes

jjyao added 2 commits June 27, 2023 22:03

comments

013c79c

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

comments

fb7ff09

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao requested a review from edoakes June 28, 2023 16:46

edoakes approved these changes Jun 28, 2023

View reviewed changes

Merge branch 'master' of github.com:ray-project/ray into jjyao/separate

a28552d

jjyao merged commit 01599ad into ray-project:master Jul 1, 2023

jjyao deleted the jjyao/separate branch July 1, 2023 14:40

alexeykudinkin mentioned this pull request Feb 6, 2024

[Serve] Fixing DeploymentStateManager qualifying replicas as running prematurely #43025

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Separate the serve scheduing logic into its own class #36588

[Serve] Separate the serve scheduing logic into its own class #36588

jjyao commented Jun 20, 2023 •

edited

Loading

jjyao commented Jun 21, 2023

jjyao Jun 21, 2023

edoakes Jun 21, 2023

jjyao Jun 21, 2023

jjyao Jun 21, 2023

edoakes left a comment

edoakes Jun 21, 2023

sihanwang41 Jun 23, 2023

jjyao Jun 26, 2023

sihanwang41 Jun 23, 2023

jjyao Jun 26, 2023

sihanwang41 Jun 23, 2023

jjyao Jun 26, 2023

jjyao commented Jun 25, 2023

shrekris-anyscale left a comment

shrekris-anyscale Jun 26, 2023

jjyao Jun 26, 2023

shrekris-anyscale Jun 26, 2023

edoakes Jun 27, 2023

jjyao Jun 28, 2023

shrekris-anyscale left a comment

edoakes Jun 27, 2023

jjyao Jun 28, 2023

edoakes Jun 27, 2023

edoakes Jun 27, 2023

edoakes Jun 27, 2023

jjyao Jun 28, 2023

edoakes Jun 27, 2023

jjyao Jun 28, 2023

edoakes Jun 27, 2023

jjyao Jun 28, 2023

edoakes left a comment

[Serve] Separate the serve scheduing logic into its own class #36588

[Serve] Separate the serve scheduing logic into its own class #36588

Conversation

jjyao commented Jun 20, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

jjyao commented Jun 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edoakes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjyao commented Jun 25, 2023

shrekris-anyscale left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrekris-anyscale left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edoakes left a comment

Choose a reason for hiding this comment

jjyao commented Jun 20, 2023 •

edited

Loading