Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Separate the serve scheduing logic into its own class #36588

Merged
merged 25 commits into from
Jul 1, 2023

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Jun 20, 2023

Why are these changes needed?

Separate the serve scheduling logic into it's own class and switch to batch scheduling for making better scheduling decisions.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

jjyao added 7 commits June 20, 2023 10:58
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao changed the title [Serve] Separate the serve scheduing logic into its own class [WIP] [Serve] Separate the serve scheduing logic into its own class Jun 21, 2023
@jjyao jjyao marked this pull request as ready for review June 21, 2023 21:09
@jjyao jjyao requested review from zcin, edoakes and sihanwang41 June 21, 2023 21:09
@jjyao
Copy link
Collaborator Author

jjyao commented Jun 21, 2023

The PR is ready for first round of review. In the meantime, I'll fix and add more tests.


# Check if the model_id has changed.
running_replicas_changed |= self._multiplexed_model_ids_updated
self._check_and_update_replicas()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I flip the order between _check_and_update_replicas and _scale_deployment_replicas since I think we should first refresh the status of existing replicas and then scale based on the latest status. For example, its possible that _check_and_update_replicas stopped a replica so that _scale_deployment_replicas needs to start a new one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be fine; I don't think this ordering is important. The only downside is it may take an extra iteration for deployment to initially go from starting -> running.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's true if starting/initialization is super fast.

self._deployment_states[deployment_name].stop_replicas(replicas_to_stop)

for deployment_name, deployment_state in self._deployment_states.items():
if set(running_replica_infos_before_update[deployment_name]) != set(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just decide whether to notify by comparing before and after.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks good for a first cut

python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved

# Check if the model_id has changed.
running_replicas_changed |= self._multiplexed_model_ids_updated
self._check_and_update_replicas()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be fine; I don't think this ordering is important. The only downside is it may take an extra iteration for deployment to initially go from starting -> running.

python/ray/serve/_private/deployment_scheduler.py Outdated Show resolved Hide resolved
python/ray/serve/_private/deployment_scheduler.py Outdated Show resolved Hide resolved
jjyao added 5 commits June 21, 2023 17:31
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao requested a review from edoakes June 22, 2023 06:08
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@scottjlee scottjlee assigned scottjlee and unassigned sihanwang41, edoakes and zcin Jun 22, 2023
jjyao added 3 commits June 22, 2023 14:03
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
python/ray/serve/_private/deployment_scheduler.py Outdated Show resolved Hide resolved
python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
upscales: Dict[str, List[ReplicaSchedulingRequest]],
downscales: Dict[str, DeploymentDownscaleRequest],
):
"""This is called for each update cycle to do batch scheduling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand why we need to have self._pending_replicas?
For spread policy, is it always guaranteed that scheduled request will be consumed inside the schedule() call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that for spread policy, all the schedule requests can be fulfilled immediately inside the schedule() call. But for driver deployment policy, it might not be the case: e.g., if there are recovering replicas, scheduler won't schedule new replicas to avoid multiple replicas on the same node until future schedule() call when all replicas are recovered.

"""
replicas_to_stop = set()

pending_launching_recovering_replicas = set().union(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be behavior change in this case.
Why do we want to remove the non-running replicas first? I think It is not align Prioritize replicas that have fewest copies on a node. in the comments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the same behavior: the code will try to prioritize replicas without node id first.

Comment from the original code:

# Replicas not in running state might have _node_id = None.
# We will prioritize those first.

python/ray/serve/_private/deployment_scheduler.py Outdated Show resolved Hide resolved
It makes scheduling decisions in a batch mode for each update cycle.
"""

def __init__(self, gcs_client: Optional[GcsClient] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hesitant to let the scheduler managing the replica status which is duplicated with DeploymentState responsibility. I may miss some context, can you help me to understand why scheduler need to manage all the replica status. (I was thinking to let DeploymentState to call the scheduler to schedule replicas & remove replicas).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deployment scheduler has it's own state machine which is different from the deployment state one: the state is based on whether we know the node id of the replica since node id information is important for scheduling implementation. For example, deployment scheduler doesn't have a UPDATING state since from scheduler's point of view it's the same as RUNNING (i.e the replica node id is known).

@jjyao
Copy link
Collaborator Author

jjyao commented Jun 25, 2023

long_running_serve and long_running_serve_failure passed: https://buildkite.com/ray-project/release-tests-pr/builds/43414#_

jjyao added 2 commits June 25, 2023 21:49
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao changed the title [WIP] [Serve] Separate the serve scheduing logic into its own class [Serve] Separate the serve scheduing logic into its own class Jun 26, 2023
@jjyao jjyao requested a review from sihanwang41 June 26, 2023 16:52
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work so far! I left some comments.

python/ray/serve/_private/deployment_scheduler.py Outdated Show resolved Hide resolved
python/ray/serve/_private/deployment_scheduler.py Outdated Show resolved Hide resolved
python/ray/serve/_private/deployment_scheduler.py Outdated Show resolved Hide resolved
python/ray/serve/_private/deployment_scheduler.py Outdated Show resolved Hide resolved
python/ray/serve/_private/deployment_scheduler.py Outdated Show resolved Hide resolved
python/ray/serve/_private/deployment_state.py Show resolved Hide resolved
def _notify_running_replicas_changed(self):
def notify_running_replicas_changed(self) -> None:
running_replica_infos = self.get_running_replica_infos()
if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, this isn't a behavior change right? This is an optimization to reduce the number of times we send a notification with the LongPollHost?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no behavior change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this related to the scheduling change? or just an independent change?

if unrelated, please separate into its own PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it's related.

python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
jjyao added 2 commits June 26, 2023 14:53
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao requested a review from shrekris-anyscale June 27, 2023 04:35
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

# so that we can make sure we don't schedule two replicas on the same node.
return

all_nodes = {node_id for node_id, _ in get_all_node_ids(self._gcs_client)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why's the custom gcs_client necessary here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because get_all_node_ids needs gcs_client as argument.

def _notify_running_replicas_changed(self):
def notify_running_replicas_changed(self) -> None:
running_replica_infos = self.get_running_replica_infos()
if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this related to the scheduling change? or just an independent change?

if unrelated, please separate into its own PR.

python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
for replica in self._replicas.pop(states=[ReplicaState.STARTING]):
self._stop_replica(replica)

return upscale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confused by this return -- why are we early returning upscale list in the downscale case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just to early return with an empty list?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, just early return an empty list. I changed to return an empty list directly to be obvious

Comment on lines 2010 to 2015
num_existing_replicas = self._replicas.count()
if num_existing_replicas >= self._target_state.num_replicas:
num_running_replicas = self._replicas.count(states=[ReplicaState.RUNNING])
if num_running_replicas >= self._target_state.num_replicas:
for replica in self._replicas.pop(states=[ReplicaState.STARTING]):
self._stop_replica(replica)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this logic live in the scheduler rather than here as part of the downscale request?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the normal downscaling (driver deployment doesn't downscale based on qps). It's about cancelling extra replicas.

I added some comments to make it clear:

# Cancel starting replicas when driver deployment state creates
# more replicas than alive nodes.
# For example, get_all_node_ids returns 4 nodes when
# the driver deployment state decides the target number of replicas
# but later on when the deployment scheduler schedules these 4 replicas,
# there are only 3 alive nodes (1 node dies in between).
# In this case, 1 replica will be in the PENDING_ALLOCATION and we
# cancel it here.

python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
Comment on lines 2477 to 2481
deleted, recovering, upscale, downscale = deployment_state.update()
if upscale:
upscales[deployment_name] = upscale
if downscale:
downscales[deployment_name] = downscale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deleted, recovering, upscale, downscale = deployment_state.update()
if upscale:
upscales[deployment_name] = upscale
if downscale:
downscales[deployment_name] = downscale
deleted, recovering, upscales[deployment_name], downscales[deployment_name] = deployment_state.update()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to add to upscales/downscales dict if the deployment has no upscale or downscale request.

jjyao added 2 commits June 27, 2023 22:03
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao requested a review from edoakes June 28, 2023 16:46
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's merge after branch cut!

@jjyao jjyao merged commit 01599ad into ray-project:master Jul 1, 2023
@jjyao jjyao deleted the jjyao/separate branch July 1, 2023 14:40
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…oject#36588)

Separate the serve scheduling logic into it's own class and switch to batch scheduling for making better scheduling decisions.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants