-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support jobQueueSize and job ticket per model in model config YAML #2350
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2350 +/- ##
=======================================
Coverage 71.92% 71.92%
=======================================
Files 78 78
Lines 3648 3648
Branches 58 58
=======================================
Hits 2624 2624
Misses 1020 1020
Partials 4 4 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
* the job queue size of a model. By default, job_queue_size is set as 100 in config.property | ||
* for all models. Here, jobQueueSize: -1 means no customized setting for the model. | ||
*/ | ||
private int jobQueueSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my information, where do we default it to 100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default queueSize 100 is passed into Model. This customized config value is used to overwrite the queueSize in this PR.
@@ -34,6 +34,11 @@ public BaseModelRequest getRequest(String threadName, WorkerState state) | |||
model.pollBatch( | |||
threadName, (state == WorkerState.WORKER_MODEL_LOADED) ? 0 : Long.MAX_VALUE, jobs); | |||
|
|||
if (model.isUseJobTicket() && jobs.isEmpty()) { | |||
model.decNumJobTickets(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the NumJobTickets
be negative if simply keep reducing the number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ticket is increased when there is an idle worker coming to pollbatch.
the behind logic is to increase a ticket at pollbatch; decrease a ticket at addjob.
The corner case is there are no incoming requests. idle workers keep trying to pollbatch and get empty request. Here is used to make sure the number of tickets are always balanced (ie. +1/-1)
return false; | ||
} | ||
|
||
this.numJobTickets.decrementAndGet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we decreasing here when just checking if hasJobTickets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multi-threads can query func hasJobTickets. Decreasing in this func is used to guarantee the ticket is available for this request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could there potentially be a race condition here with two threads calling hasJobTickets
simultaneously which could cause numJobTickets
to become negative since at the time of checking, both threads may not have completed the update just yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a synchronized func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, then a function name like getJobTicket
or takeJobTicket
would make more sense, as it's actually trying to reduce the ticket count instead of just checking
@@ -225,6 +234,10 @@ public void removeJobQueue(String threadId) { | |||
} | |||
|
|||
public boolean addJob(Job job) { | |||
if (isUseJobTicket() && !hasJobTickets()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will a similar check for available job tickets be required in the public void addJob(String threadId, Job job)
method as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again hasJobTickets is synchronized func
@@ -253,6 +266,9 @@ public void pollBatch(String threadId, long waitTime, Map<String, Job> jobsRepo) | |||
} | |||
|
|||
try { | |||
if (isUseJobTicket()) { | |||
incNumJobTickets(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will a similar update to job ticket count be required in the code segment above:
LinkedBlockingDeque<Job> jobsQueue = jobsDb.get(threadId);
if (jobsQueue != null && !jobsQueue.isEmpty()) {
Job j = jobsQueue.poll(waitTime, TimeUnit.MILLISECONDS);
if (j != null) {
jobsRepo.put(j.getJobId(), j);
return;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are only tracking requests on the DEFAULT_DATA_QUEUE
, so it won't be required for the above case, makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lxning LGTM and I think its a great feature. It would be also great if you could add an example/ extending the LLM example for custom job queue size as well.
docs/large_model_inference.md
Outdated
@@ -194,3 +194,12 @@ torchrun: | |||
# gpus you wish to split your model | |||
OMP_NUMBER_THREADS: 2 | |||
``` | |||
#### Feature Job ticket is recommended for the use case of inference latency Sensitive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### Feature Job ticket is recommended for the use case of inference latency Sensitive | |
#### Feature Job ticket is recommended for the use case of inference latency sensitive |
docs/large_model_inference.md
Outdated
@@ -194,3 +194,12 @@ torchrun: | |||
# gpus you wish to split your model | |||
OMP_NUMBER_THREADS: 2 | |||
``` | |||
#### Feature Job ticket is recommended for the use case of inference latency Sensitive | |||
When the job ticket feature is enabled, it verifies the availability of a model's active worker for processing a client's request. If an active worker is available, the request is accepted; otherwise, a 503 response is sent back to client. Here is an example of enabling job ticket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lxning shall we elaborate a bit more on why someone may want to use it? describing a bit of use-case/ motivation will help to raise awareness when and how user can use this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added more information in this doc
docs/large_model_inference.md
Outdated
@@ -194,3 +194,12 @@ torchrun: | |||
# gpus you wish to split your model | |||
OMP_NUMBER_THREADS: 2 | |||
``` | |||
#### Feature Job ticket is recommended for the use case of inference latency sensitive | |||
When the job ticket feature is enabled, TorchServe verifies the availability of a model's active worker for processing a client's request. If an active worker is available, the request is accepted and processed immediately without waiting time incurred from job queue or dynamic batching; otherwise, a 503 response is sent back to client. This feature can help the use case which is sensitive to inference latency, such as interactive code generator, chatGPT. These applications can take effective actions,for example, routing the rejected request to a different server, or scaling up model server capacity, based on the business requirements. Here is an example of enabling job ticket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the job ticket feature is enabled, TorchServe verifies the availability of a model's active worker for processing a client's request. If an active worker is available, the request is accepted and processed immediately without waiting time incurred from job queue or dynamic batching; otherwise, a 503 response is sent back to client. This feature can help the use case which is sensitive to inference latency, such as interactive code generator, chatGPT. These applications can take effective actions,for example, routing the rejected request to a different server, or scaling up model server capacity, based on the business requirements. Here is an example of enabling job ticket. | |
When the job ticket feature is enabled, TorchServe verifies the availability of a model's active worker for processing a client's request. If an active worker is available, the request is accepted and processed immediately without waiting time incurred from job queue or dynamic batching; otherwise, a 503 response is sent back to client. | |
This feature help with use-cases where inference latency can be high, such as generative models, auto regressive decoder models like chatGPT. This feature help such applications to take effective actions, for example, routing the rejected request to a different server, or scaling up model server capacity, based on the business requirements. Here is an example of enabling job ticket. |
Description
Please read our CONTRIBUTING.md prior to creating your first pull request.
Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes #(issue)
#2322
#2333
Type of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Test jobQueueSize
Add model_config.yaml in frontend/archive/src/test/resources/models/mnist_scripted.mar for unit test.
jobQueueSize: 1
Test jobTicket
seq 3 | xargs -n1 -P 3 curl http://localhost:8080/predictions/mnist_scripted_v2 -T test_data/0.png
{
"code": 503,
"type": "ServiceUnavailableException",
"message": "Model "mnist_scripted_v2" has no worker to serve inference request. Please use scale workers API to add workers."
}
Checklist: