Introduce queue timeout for prediction requests #1720

nateagr · 2022-07-01T15:22:15Z

Description

This PR follows this feature request #1322

The feature request is not related to a bug. But at my company, we have this following problem: we would like to be able to dismiss prediction requests after a certain amount of time. Why ? Because they are not relevant anymore after this amount of time. As of today, it is possible to define a timeout, responseTimeout (https://pytorch.org/serve/configuration.html), but it only applies to model inference (handler). It does not include the time spent in queue. This PR adds the possibility to define a maximum amount of time spent in queue. Combined with responseTimeout (time spent in Python for inference), the total duration correspond to the maximum amount of time before a prediction request is stale. Instead of creating one parameter that would include the time spent in queue and the inference time, I've decided to add a maximum duration in queue for the following reasons:

The new parameter does not interfere with the existing parameter responseTimeout, that I don't want to modify to not add regressions for users already using this parameter.
The latency of a prediction request includes the time spent in queue and inference time. For inference, predictions requests are grouped in batches. It is not clear what duration to use to timeout inference given a batch of prediction requests that have potentially spent different amounts of time in queue, and thus having a different remaining time budget.
In my opinion, splitting the maximum latency of a prediction request in two (maximum duration in queue and maximum duration for inference) gives more control to users and is more flexible. Users normally now inference time (at least we have a rough idea). Let's say I've put a model in production that takes ~300ms for inference and prediction requests are stale after 500ms. I don't want torchserve to send prediction requests that have spent more than 200ms in queue to Python for inference (because predictions will be stale when available). So instead, I prefer telling torchserve to dismiss any prediction requests that were in queue for more than 200ms.

Type of change

New feature (non-breaking change which adds functionality)

Feature/Issue validation/testing

I've added a simple unit test. I did not find tests that test this kind of small feature so I'm opened to any suggestion/advice.

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

msaroufim · 2022-07-01T17:31:53Z

Thank you for your contribution @nateagr I think the feature makes a lot of sense. I just kicked off our Github Action CI (historically our codebuild Windows and GPU CI is flaky). The most important two tests are the GPU and Linux ones so let's make sure those pass. We have a long weekend ending on Tuesday morning so I'll review your code more thoroughly then.

Also if you have any feedback or general feedback requests to the TorchServe please email me, would love to chat more. Are you using or planning to use Torchserve in production?

nateagr · 2022-07-04T07:47:09Z

Hello @msaroufim ! Thanks for the new CI. I opened a PR few months ago for the same feature request but did not manage to pass the CI.

We are currently using Torchserve in production :) Feel free to ask for more details. I probably can provide more details/feedback to you.

nateagr · 2022-07-04T12:10:56Z

It is not clear to me why testModelWithCustomPythonDependency fails

msaroufim

CI does indeed pass. Windows failure is unrelated to your change. Thank you for making this change, could you please also update the docs with this new configuration parameter? https://github.com/pytorch/serve/blob/master/docs/configuration.md#other-properties

frontend/server/src/main/java/org/pytorch/serve/openapi/OpenApiUtils.java

lxning · 2022-07-05T17:55:49Z

@nateagr Thank you for the PR. From customer perspective, users only care about the overall inference latency. Internally TS can have multiple phases/queue (eg. in workflow). I think your request is to terminate the inference request if the overall timeout is reached. Adding queue timeout will confuse workflow customers since they don't know which queue is applied.

nateagr · 2022-07-06T07:56:36Z

@nateagr Thank you for the PR. From customer perspective, users only care about the overall inference latency. Internally TS can have multiple phases/queue (eg. in workflow). I think your request is to terminate the inference request if the overall timeout is reached. Adding queue timeout will confuse workflow customers since they don't know which queue is applied.

Hello @lxning ! Thank you very much for your feedback. From what I understand, you would like to avoid introducing a parameter dedicated to request queuing as other steps could be involved in the workflow processing a request. My intent is to avoid expensive computation on stale requests (like requests that have been waiting to be processed for too long). I initially thought using the existing parameter responseTimeout. But I then realized that it would not be a good idea as it would create behavioral modification to existing TorchServe instances.

What do you suggest ?

lxning · 2022-07-14T17:02:41Z

@nateagr I suggest using "inferenceTimeout" to measure the overall end-to-end timeout. For example, user specifies "inferenceTimeout= 30sec", an inference request will be terminated as long as this request exceeds 30sec after torchserve receive the request.

nateagr · 2022-07-18T06:07:32Z

@nateagr I suggest using "inferenceTimeout" to measure the overall end-to-end timeout. For example, user specifies "inferenceTimeout= 30sec", an inference request will be terminated as long as this request exceeds 30sec after torchserve receive the request.

Hi @lxning ! Thanks for your reply. By "inferenceTimeout", do you mean "responseTimeout" ? Which is the parameter actually used to set up a timeout for inference and inference only (see https://github.com/pytorch/serve/blob/master/frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerThread.java#L189)

As explained in my first message of this PR, we would like to be able to dismiss prediction requests after a certain amount of time because they are not relevant anymore after this amount of time. There is currently the parameter "responseTimeout" which controls the maximum duration spent for inference. But this parameter does not fit our needs because torchserve will try to run inference for prediction requests that have been sitting in the queue for a long period of time which is what we want to avoid/control. As an example, let's say prediction requests are considered stale after 300ms and we set "responseTimeout" to 300ms. Torchserve will run inference for all prediction requests even those that have been in the queue for more than 300ms.

In my opinion, it is no secret that there is a waiting queue used for server-side batching and it is a good/important feature to let users controlling the maximum amount of time spent in the queue to invalidate stale prediction requests. Processing stale prediction requests use compute needlessly and increase latency of relevant prediction requests.

I'm totally opened to other suggestion and will gladly implement any other alternative so please feel free to suggest other solution that comes to mind.

lxning · 2022-07-18T18:58:35Z

@nateagr Ideally, "responseTimeout" should be a good parameter name to measure end2end inference latency timeout. However, it is already used to measure model inference latency at backend in existing TorchServe. So I suggested introducing a new parameter "inferenceTimeout" to measure end2end inference latency timeout. Here, the end2end means door(ie. Torchserve) to door (ie. Torchserve). As I mentioned before, I believe a TorchServe client's requirement is

terminate the inference request as long as "inferenceTimeout" is reached.
provide latency metrics of an inference request in Model server internal different phases to perf analysis.

Current Torchserve already meets requirements 2. For requirements 1, you can contribute if you are interested in.

nateagr · 2022-07-19T06:27:45Z

Hi @lxning, I'm going to rephrase what you are suggesting just to make sure we are in phase: we introduce a new parameter "inferenceTimeout" that we use to terminate/abort prediction request that are too old. In contrast to existing parameter "responseTimeout" that only applies to Python inference, "inferenceTimeout" will also be used to terminate/abort prediction request in earlier/later stages such as in queue. So I will rename "queueTimeout" (that I introduce in this PR) in "inferenceTimeout" and I will also use this new parameter to abort other stages.
Have I understand correctly ?

lxning · 2022-10-21T23:11:11Z

Create a ticket to handle expired inference requests.

msaroufim · 2023-04-26T03:10:11Z

@nateagr we ended up introducing a client timeout which would serve a similar purpose to this #2267

Introduce queue timeout for prediction requests

2b40f5a

msaroufim self-requested a review July 1, 2022 17:12

msaroufim added the p1 mid priority label Jul 1, 2022

msaroufim requested review from lxning and maaquib July 1, 2022 17:34

Fixed tests

dbd8326

Fixed JAVA formatting

f88b585

msaroufim approved these changes Jul 5, 2022

View reviewed changes

frontend/server/src/main/java/org/pytorch/serve/openapi/OpenApiUtils.java Show resolved Hide resolved

Fixed typo and updated doc

bf42d2b

This was referenced Mar 6, 2023

Pull QueueTimeout implementation PR textshuttle/pytorch-serve#2

Closed

QueueTimeout feature from #1720 textshuttle/pytorch-serve#3

Open

msaroufim closed this Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce queue timeout for prediction requests #1720

Introduce queue timeout for prediction requests #1720

nateagr commented Jul 1, 2022 •

edited

Loading

msaroufim commented Jul 1, 2022 •

edited

Loading

nateagr commented Jul 4, 2022

nateagr commented Jul 4, 2022

msaroufim left a comment •

edited

Loading

lxning commented Jul 5, 2022

nateagr commented Jul 6, 2022

lxning commented Jul 14, 2022

nateagr commented Jul 18, 2022 •

edited

Loading

lxning commented Jul 18, 2022

nateagr commented Jul 19, 2022

lxning commented Oct 21, 2022

msaroufim commented Apr 26, 2023

Introduce queue timeout for prediction requests #1720

Introduce queue timeout for prediction requests #1720

Conversation

nateagr commented Jul 1, 2022 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

msaroufim commented Jul 1, 2022 • edited Loading

nateagr commented Jul 4, 2022

nateagr commented Jul 4, 2022

msaroufim left a comment • edited Loading

Choose a reason for hiding this comment

lxning commented Jul 5, 2022

nateagr commented Jul 6, 2022

lxning commented Jul 14, 2022

nateagr commented Jul 18, 2022 • edited Loading

lxning commented Jul 18, 2022

nateagr commented Jul 19, 2022

lxning commented Oct 21, 2022

msaroufim commented Apr 26, 2023

nateagr commented Jul 1, 2022 •

edited

Loading

msaroufim commented Jul 1, 2022 •

edited

Loading

msaroufim left a comment •

edited

Loading

nateagr commented Jul 18, 2022 •

edited

Loading