Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Separate the serve scheduing logic into its own class #36588
[Serve] Separate the serve scheduing logic into its own class #36588
Changes from 15 commits
18cb50c
0ddad3e
06e1ada
ff4c616
2b8047a
f790da7
9f6bca2
104df7b
e9f9333
42511c3
01e0cef
8fca862
58e372c
4c4cbcf
b0e9a46
327b21e
6d0c1d2
c9d197c
fc98129
3059be7
cf9d195
5bd9b4b
013c79c
fb7ff09
a28552d
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hesitant to let the scheduler managing the replica status which is duplicated with DeploymentState responsibility. I may miss some context, can you help me to understand why scheduler need to manage all the replica status. (I was thinking to let DeploymentState to call the scheduler to schedule replicas & remove replicas).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deployment scheduler has it's own state machine which is different from the deployment state one: the state is based on whether we know the node id of the replica since node id information is important for scheduling implementation. For example, deployment scheduler doesn't have a UPDATING state since from scheduler's point of view it's the same as RUNNING (i.e the replica node id is known).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand why we need to have self._pending_replicas?
For spread policy, is it always guaranteed that scheduled request will be consumed inside the schedule() call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that for spread policy, all the schedule requests can be fulfilled immediately inside the
schedule()
call. But for driver deployment policy, it might not be the case: e.g., if there are recovering replicas, scheduler won't schedule new replicas to avoid multiple replicas on the same node until future schedule() call when all replicas are recovered.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why's the custom gcs_client necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because
get_all_node_ids
needs gcs_client as argument.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be behavior change in this case.
Why do we want to remove the non-running replicas first? I think It is not align
Prioritize replicas that have fewest copies on a node.
in the comments.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its the same behavior: the code will try to prioritize replicas without node id first.
Comment from the original code: