Leave response and sendError when request is canceled #3267

slashvar · 2024-07-24T17:05:22Z

Description

Leave response and sendError method when responseObserver is canceled to avoid continuing on canceled requests.

This was the previous behavior introduced by #2420 and lost during a later refactoring. A similar early return on cancel exists for case OIPPREDICT in response method.

Fixes #3087

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing

To my knowledge, there is no dedicated tests for this issue (it was fixed before and then reintroduced).

Currently trying to figure out how to properly test it.

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

agunapal · 2024-07-24T23:34:40Z

Hi @slashvar Thanks for sending this PR!

Do you have the steps for how to repro this problem

slashvar · 2024-07-25T10:10:33Z

Hi @slashvar Thanks for sending this PR!

Do you have the steps for how to repro this problem

Hi @agunapal !

For now, I don't have an easy repro step. On our side, we see it happening under heavily load or when the number of workers/instances is a bit too small. We have a tight time budget (about 10 to 50ms max) and usually a very large volume of requests (something like 100 to 1k request per seconds, on several instances with multiple workers).

At least the scenario is always the same:

request timeout on the client side
We have a first log:

2024-07-24T09:40:27,155 [WARN ] W-9001-minilm_1.0 org.pytorch.serve.job.GRPCJob - grpc client call already cancelled, not able to send this response for requestId: d72a637a-8771-4a33-9483-15a30a52a430

Then an exception:

2024-07-24T09:40:27,155 [ERROR] W-9001-minilm_1.0 org.pytorch.serve.wlm.WorkerThread - IllegalStateException error
java.lang.IllegalStateException: Stream was terminated by error, no further calls are allowed
	at com.google.common.base.Preconditions.checkState(Preconditions.java:502) ~[model-server.jar:?]
	at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:374) ~[model-server.jar:?]
	at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:130) ~[model-server.jar:?]
	at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:99) ~[model-server.jar:?]
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:238) ~[model-server.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1570) [?:?]

And after that, the worker stops answering requests.

This is the same problem described in the linked issue #3087 but also in #3168 and #3091

I'm working on simple reproduction step that I can share.

agunapal · 2024-07-25T16:35:52Z

@slashvar Thank you for the details. Have you verified this fix works for you?

slashvar · 2024-07-30T18:41:58Z

@slashvar Thank you for the details. Have you verified this fix works for you?

Hello, sorry for the delay, settings an environment where I can properly test the issue with build from source was longer than expected.

TL;DR the fix works.

I've managed to reproduce the problem using the mnist example, using a a gRPC client in go (adapting my code that reproduces the issue within our environment) and an explicit timeout on the request.

Without the fix:

HTTP calls and gRPC with reasonable timeouts succeeds until a gRPC call trigger the issue
The issue is a bit tricky to reproduce as the timeout should long enough otherwise we don't even establish the connection, but short enough to kill the request before getting an answer. With the proper timeout, we can see the exception.
After the issue: no calls, neither grpc nor http, succeeds.

With the fix:

I no longer get the exception in case of timeout
All calls (with reasonable timeouts) always succeeds

One note: there is still another exception from time to time but this one happen when closing the request (when calling responseObserver.onCompleted()). This one does not block the worker but I saw it restarting which is not good for high availability. I'm testing an extra fix for this one too.

Here is my test code: https://gist.github.com/slashvar/8874c52d88895a922398289f81cd7a08
(in the import I had to remove the path to my generated proto as it was pointing to a private repo).

agunapal · 2024-07-31T16:33:44Z

Awesome. Thanks! I'll try your fix with a testcase I tried. Load an LLM with max_new_tokens = large value and send an inference request

agunapal · 2024-07-31T22:33:25Z

@slashvar I'm hoping my repro is reliable. I basically cancel one request and then send another request. and I see it crash. I tried with your fix and I don't see the crash any more. My colleague @namannandan has identified other cases to be fixed.

For now, I'll merge your PR. Please try with the nightlies to see if your issue is resolved. We can merge the other PR if the issue is not fully resolved.

agunapal

LGTM. Tried the fix. and it seems to address the crash

Leave response and sendError when request is canceled

ad99cc7

slashvar mentioned this pull request Jul 24, 2024

TorchServe crashes in production with `WorkerThread - IllegalStateException error' #3087

Closed

Add comment referencing the issue.

6b5a4a9

slashvar marked this pull request as ready for review July 24, 2024 17:24

agunapal added the grpc label Jul 24, 2024

fix: catch canceled job before closing too

7b4978a

namannandan mentioned this pull request Jul 31, 2024

Add exception handling in server for when gRPC client closes stream #3272

Draft

3 tasks

agunapal approved these changes Jul 31, 2024

View reviewed changes

agunapal added this pull request to the merge queue Jul 31, 2024

Merged via the queue into pytorch:master with commit e5db414 Jul 31, 2024
9 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leave response and sendError when request is canceled #3267

Leave response and sendError when request is canceled #3267

slashvar commented Jul 24, 2024 •

edited

Loading

agunapal commented Jul 24, 2024

slashvar commented Jul 25, 2024

agunapal commented Jul 25, 2024

slashvar commented Jul 30, 2024

agunapal commented Jul 31, 2024

agunapal commented Jul 31, 2024

agunapal left a comment

Leave response and sendError when request is canceled #3267

Leave response and sendError when request is canceled #3267

Conversation

slashvar commented Jul 24, 2024 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

agunapal commented Jul 24, 2024

slashvar commented Jul 25, 2024

agunapal commented Jul 25, 2024

slashvar commented Jul 30, 2024

agunapal commented Jul 31, 2024

agunapal commented Jul 31, 2024

agunapal left a comment

Choose a reason for hiding this comment

slashvar commented Jul 24, 2024 •

edited

Loading