-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce queue timeout for prediction requests #1720
Conversation
Thank you for your contribution @nateagr I think the feature makes a lot of sense. I just kicked off our Github Action CI (historically our codebuild Windows and GPU CI is flaky). The most important two tests are the GPU and Linux ones so let's make sure those pass. We have a long weekend ending on Tuesday morning so I'll review your code more thoroughly then. Also if you have any feedback or general feedback requests to the TorchServe please email me, would love to chat more. Are you using or planning to use Torchserve in production? |
Hello @msaroufim ! Thanks for the new CI. I opened a PR few months ago for the same feature request but did not manage to pass the CI. We are currently using Torchserve in production :) Feel free to ask for more details. I probably can provide more details/feedback to you. |
It is not clear to me why testModelWithCustomPythonDependency fails |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI does indeed pass. Windows failure is unrelated to your change. Thank you for making this change, could you please also update the docs with this new configuration parameter? https://github.com/pytorch/serve/blob/master/docs/configuration.md#other-properties
frontend/server/src/main/java/org/pytorch/serve/openapi/OpenApiUtils.java
Show resolved
Hide resolved
@nateagr Thank you for the PR. From customer perspective, users only care about the overall inference latency. Internally TS can have multiple phases/queue (eg. in workflow). I think your request is to terminate the inference request if the overall timeout is reached. Adding queue timeout will confuse workflow customers since they don't know which queue is applied. |
Hello @lxning ! Thank you very much for your feedback. From what I understand, you would like to avoid introducing a parameter dedicated to request queuing as other steps could be involved in the workflow processing a request. My intent is to avoid expensive computation on stale requests (like requests that have been waiting to be processed for too long). I initially thought using the existing parameter responseTimeout. But I then realized that it would not be a good idea as it would create behavioral modification to existing TorchServe instances. What do you suggest ? |
@nateagr I suggest using "inferenceTimeout" to measure the overall end-to-end timeout. For example, user specifies "inferenceTimeout= 30sec", an inference request will be terminated as long as this request exceeds 30sec after torchserve receive the request. |
Hi @lxning ! Thanks for your reply. By "inferenceTimeout", do you mean "responseTimeout" ? Which is the parameter actually used to set up a timeout for inference and inference only (see https://github.com/pytorch/serve/blob/master/frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerThread.java#L189) As explained in my first message of this PR, we would like to be able to dismiss prediction requests after a certain amount of time because they are not relevant anymore after this amount of time. There is currently the parameter "responseTimeout" which controls the maximum duration spent for inference. But this parameter does not fit our needs because torchserve will try to run inference for prediction requests that have been sitting in the queue for a long period of time which is what we want to avoid/control. As an example, let's say prediction requests are considered stale after 300ms and we set "responseTimeout" to 300ms. Torchserve will run inference for all prediction requests even those that have been in the queue for more than 300ms. In my opinion, it is no secret that there is a waiting queue used for server-side batching and it is a good/important feature to let users controlling the maximum amount of time spent in the queue to invalidate stale prediction requests. Processing stale prediction requests use compute needlessly and increase latency of relevant prediction requests. I'm totally opened to other suggestion and will gladly implement any other alternative so please feel free to suggest other solution that comes to mind. |
@nateagr Ideally, "responseTimeout" should be a good parameter name to measure end2end inference latency timeout. However, it is already used to measure model inference latency at backend in existing TorchServe. So I suggested introducing a new parameter "inferenceTimeout" to measure end2end inference latency timeout. Here, the end2end means door(ie. Torchserve) to door (ie. Torchserve). As I mentioned before, I believe a TorchServe client's requirement is
Current Torchserve already meets requirements 2. For requirements 1, you can contribute if you are interested in. |
Hi @lxning, I'm going to rephrase what you are suggesting just to make sure we are in phase: we introduce a new parameter "inferenceTimeout" that we use to terminate/abort prediction request that are too old. In contrast to existing parameter "responseTimeout" that only applies to Python inference, "inferenceTimeout" will also be used to terminate/abort prediction request in earlier/later stages such as in queue. So I will rename "queueTimeout" (that I introduce in this PR) in "inferenceTimeout" and I will also use this new parameter to abort other stages. |
Create a ticket to handle expired inference requests. |
Description
This PR follows this feature request #1322
The feature request is not related to a bug. But at my company, we have this following problem: we would like to be able to dismiss prediction requests after a certain amount of time. Why ? Because they are not relevant anymore after this amount of time. As of today, it is possible to define a timeout, responseTimeout (https://pytorch.org/serve/configuration.html), but it only applies to model inference (handler). It does not include the time spent in queue. This PR adds the possibility to define a maximum amount of time spent in queue. Combined with responseTimeout (time spent in Python for inference), the total duration correspond to the maximum amount of time before a prediction request is stale. Instead of creating one parameter that would include the time spent in queue and the inference time, I've decided to add a maximum duration in queue for the following reasons:
The new parameter does not interfere with the existing parameter responseTimeout, that I don't want to modify to not add regressions for users already using this parameter.
The latency of a prediction request includes the time spent in queue and inference time. For inference, predictions requests are grouped in batches. It is not clear what duration to use to timeout inference given a batch of prediction requests that have potentially spent different amounts of time in queue, and thus having a different remaining time budget.
In my opinion, splitting the maximum latency of a prediction request in two (maximum duration in queue and maximum duration for inference) gives more control to users and is more flexible. Users normally now inference time (at least we have a rough idea). Let's say I've put a model in production that takes ~300ms for inference and prediction requests are stale after 500ms. I don't want torchserve to send prediction requests that have spent more than 200ms in queue to Python for inference (because predictions will be stale when available). So instead, I prefer telling torchserve to dismiss any prediction requests that were in queue for more than 200ms.
Type of change
Feature/Issue validation/testing
I've added a simple unit test. I did not find tests that test this kind of small feature so I'm opened to any suggestion/advice.
Checklist: