Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support jobQueueSize and job ticket per model in model config YAML #2350

Merged
merged 12 commits into from
Jun 13, 2023

Conversation

lxning
Copy link
Collaborator

@lxning lxning commented May 16, 2023

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #(issue)
#2322
#2333

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test jobQueueSize
    Add model_config.yaml in frontend/archive/src/test/resources/models/mnist_scripted.mar for unit test.
    jobQueueSize: 1

  • Test jobTicket

cat model_store/mnist/model-config.yaml
minWorkers: 2
maxWorkers: 2
jobQueueSize: 2
useJobTicket: true

seq 3 | xargs -n1 -P 3 curl http://localhost:8080/predictions/mnist_scripted_v2 -T test_data/0.png
{
"code": 503,
"type": "ServiceUnavailableException",
"message": "Model "mnist_scripted_v2" has no worker to serve inference request. Please use scale workers API to add workers."
}

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@lxning lxning added the enhancement New feature or request label May 16, 2023
@lxning lxning self-assigned this May 16, 2023
@lxning lxning added this to the v0.8.1 milestone May 16, 2023
@codecov
Copy link

codecov bot commented May 16, 2023

Codecov Report

Merging #2350 (4f291f4) into master (7f9967e) will not change coverage.
The diff coverage is n/a.

❗ Current head 4f291f4 differs from pull request most recent head 4793c33. Consider uploading reports for the commit 4793c33 to get more accurate results

@@           Coverage Diff           @@
##           master    #2350   +/-   ##
=======================================
  Coverage   71.92%   71.92%           
=======================================
  Files          78       78           
  Lines        3648     3648           
  Branches       58       58           
=======================================
  Hits         2624     2624           
  Misses       1020     1020           
  Partials        4        4           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@lxning lxning changed the title support jobQueueSize per model in model config YAML support jobQueueSize and job ticket per model in model config YAML May 19, 2023
* the job queue size of a model. By default, job_queue_size is set as 100 in config.property
* for all models. Here, jobQueueSize: -1 means no customized setting for the model.
*/
private int jobQueueSize;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my information, where do we default it to 100?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default queueSize 100 is passed into Model. This customized config value is used to overwrite the queueSize in this PR.

@@ -34,6 +34,11 @@ public BaseModelRequest getRequest(String threadName, WorkerState state)
model.pollBatch(
threadName, (state == WorkerState.WORKER_MODEL_LOADED) ? 0 : Long.MAX_VALUE, jobs);

if (model.isUseJobTicket() && jobs.isEmpty()) {
model.decNumJobTickets();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the NumJobTickets be negative if simply keep reducing the number

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ticket is increased when there is an idle worker coming to pollbatch.

the behind logic is to increase a ticket at pollbatch; decrease a ticket at addjob.

The corner case is there are no incoming requests. idle workers keep trying to pollbatch and get empty request. Here is used to make sure the number of tickets are always balanced (ie. +1/-1)

return false;
}

this.numJobTickets.decrementAndGet();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we decreasing here when just checking if hasJobTickets

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi-threads can query func hasJobTickets. Decreasing in this func is used to guarantee the ticket is available for this request.

Copy link
Collaborator

@namannandan namannandan May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there potentially be a race condition here with two threads calling hasJobTickets simultaneously which could cause numJobTickets to become negative since at the time of checking, both threads may not have completed the update just yet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a synchronized func.

Copy link

@andrewjguo andrewjguo May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, then a function name like getJobTicket or takeJobTicket would make more sense, as it's actually trying to reduce the ticket count instead of just checking

@@ -225,6 +234,10 @@ public void removeJobQueue(String threadId) {
}

public boolean addJob(Job job) {
if (isUseJobTicket() && !hasJobTickets()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will a similar check for available job tickets be required in the public void addJob(String threadId, Job job) method as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again hasJobTickets is synchronized func

@@ -253,6 +266,9 @@ public void pollBatch(String threadId, long waitTime, Map<String, Job> jobsRepo)
}

try {
if (isUseJobTicket()) {
incNumJobTickets();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will a similar update to job ticket count be required in the code segment above:

        LinkedBlockingDeque<Job> jobsQueue = jobsDb.get(threadId);
        if (jobsQueue != null && !jobsQueue.isEmpty()) {
            Job j = jobsQueue.poll(waitTime, TimeUnit.MILLISECONDS);
            if (j != null) {
                jobsRepo.put(j.getJobId(), j);
                return;
            }
        }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are only tracking requests on the DEFAULT_DATA_QUEUE, so it won't be required for the above case, makes sense.

Copy link
Collaborator

@HamidShojanazeri HamidShojanazeri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lxning LGTM and I think its a great feature. It would be also great if you could add an example/ extending the LLM example for custom job queue size as well.

@@ -194,3 +194,12 @@ torchrun:
# gpus you wish to split your model
OMP_NUMBER_THREADS: 2
```
#### Feature Job ticket is recommended for the use case of inference latency Sensitive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Feature Job ticket is recommended for the use case of inference latency Sensitive
#### Feature Job ticket is recommended for the use case of inference latency sensitive

@@ -194,3 +194,12 @@ torchrun:
# gpus you wish to split your model
OMP_NUMBER_THREADS: 2
```
#### Feature Job ticket is recommended for the use case of inference latency Sensitive
When the job ticket feature is enabled, it verifies the availability of a model's active worker for processing a client's request. If an active worker is available, the request is accepted; otherwise, a 503 response is sent back to client. Here is an example of enabling job ticket.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxning shall we elaborate a bit more on why someone may want to use it? describing a bit of use-case/ motivation will help to raise awareness when and how user can use this feature.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more information in this doc

@@ -194,3 +194,12 @@ torchrun:
# gpus you wish to split your model
OMP_NUMBER_THREADS: 2
```
#### Feature Job ticket is recommended for the use case of inference latency sensitive
When the job ticket feature is enabled, TorchServe verifies the availability of a model's active worker for processing a client's request. If an active worker is available, the request is accepted and processed immediately without waiting time incurred from job queue or dynamic batching; otherwise, a 503 response is sent back to client. This feature can help the use case which is sensitive to inference latency, such as interactive code generator, chatGPT. These applications can take effective actions,for example, routing the rejected request to a different server, or scaling up model server capacity, based on the business requirements. Here is an example of enabling job ticket.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When the job ticket feature is enabled, TorchServe verifies the availability of a model's active worker for processing a client's request. If an active worker is available, the request is accepted and processed immediately without waiting time incurred from job queue or dynamic batching; otherwise, a 503 response is sent back to client. This feature can help the use case which is sensitive to inference latency, such as interactive code generator, chatGPT. These applications can take effective actions,for example, routing the rejected request to a different server, or scaling up model server capacity, based on the business requirements. Here is an example of enabling job ticket.
When the job ticket feature is enabled, TorchServe verifies the availability of a model's active worker for processing a client's request. If an active worker is available, the request is accepted and processed immediately without waiting time incurred from job queue or dynamic batching; otherwise, a 503 response is sent back to client.
This feature help with use-cases where inference latency can be high, such as generative models, auto regressive decoder models like chatGPT. This feature help such applications to take effective actions, for example, routing the rejected request to a different server, or scaling up model server capacity, based on the business requirements. Here is an example of enabling job ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants