Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leave response and sendError when request is canceled #3267

Conversation

slashvar
Copy link
Contributor

@slashvar slashvar commented Jul 24, 2024

Description

Leave response and sendError method when responseObserver is canceled to avoid continuing on canceled requests.

This was the previous behavior introduced by #2420 and lost during a later refactoring. A similar early return on cancel exists for case OIPPREDICT in response method.

Fixes #3087

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing

To my knowledge, there is no dedicated tests for this issue (it was fixed before and then reintroduced).

Currently trying to figure out how to properly test it.

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@slashvar slashvar marked this pull request as ready for review July 24, 2024 17:24
@agunapal
Copy link
Collaborator

Hi @slashvar Thanks for sending this PR!

Do you have the steps for how to repro this problem

@agunapal agunapal added the grpc label Jul 24, 2024
@slashvar
Copy link
Contributor Author

Hi @slashvar Thanks for sending this PR!

Do you have the steps for how to repro this problem

Hi @agunapal !

For now, I don't have an easy repro step. On our side, we see it happening under heavily load or when the number of workers/instances is a bit too small. We have a tight time budget (about 10 to 50ms max) and usually a very large volume of requests (something like 100 to 1k request per seconds, on several instances with multiple workers).

At least the scenario is always the same:

  • request timeout on the client side
  • We have a first log:
2024-07-24T09:40:27,155 [WARN ] W-9001-minilm_1.0 org.pytorch.serve.job.GRPCJob - grpc client call already cancelled, not able to send this response for requestId: d72a637a-8771-4a33-9483-15a30a52a430
  • Then an exception:
2024-07-24T09:40:27,155 [ERROR] W-9001-minilm_1.0 org.pytorch.serve.wlm.WorkerThread - IllegalStateException error
java.lang.IllegalStateException: Stream was terminated by error, no further calls are allowed
	at com.google.common.base.Preconditions.checkState(Preconditions.java:502) ~[model-server.jar:?]
	at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:374) ~[model-server.jar:?]
	at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:130) ~[model-server.jar:?]
	at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:99) ~[model-server.jar:?]
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:238) ~[model-server.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1570) [?:?]

And after that, the worker stops answering requests.

This is the same problem described in the linked issue #3087 but also in #3168 and #3091

I'm working on simple reproduction step that I can share.

@agunapal
Copy link
Collaborator

@slashvar Thank you for the details. Have you verified this fix works for you?

@slashvar
Copy link
Contributor Author

@slashvar Thank you for the details. Have you verified this fix works for you?

Hello, sorry for the delay, settings an environment where I can properly test the issue with build from source was longer than expected.

TL;DR the fix works.

I've managed to reproduce the problem using the mnist example, using a a gRPC client in go (adapting my code that reproduces the issue within our environment) and an explicit timeout on the request.

Without the fix:

  • HTTP calls and gRPC with reasonable timeouts succeeds until a gRPC call trigger the issue
  • The issue is a bit tricky to reproduce as the timeout should long enough otherwise we don't even establish the connection, but short enough to kill the request before getting an answer. With the proper timeout, we can see the exception.
  • After the issue: no calls, neither grpc nor http, succeeds.

With the fix:

  • I no longer get the exception in case of timeout
  • All calls (with reasonable timeouts) always succeeds

One note: there is still another exception from time to time but this one happen when closing the request (when calling responseObserver.onCompleted()). This one does not block the worker but I saw it restarting which is not good for high availability. I'm testing an extra fix for this one too.

Here is my test code: https://gist.github.com/slashvar/8874c52d88895a922398289f81cd7a08
(in the import I had to remove the path to my generated proto as it was pointing to a private repo).

@agunapal
Copy link
Collaborator

Awesome. Thanks! I'll try your fix with a testcase I tried. Load an LLM with max_new_tokens = large value and send an inference request

@agunapal
Copy link
Collaborator

@slashvar I'm hoping my repro is reliable. I basically cancel one request and then send another request. and I see it crash. I tried with your fix and I don't see the crash any more. My colleague @namannandan has identified other cases to be fixed.

For now, I'll merge your PR. Please try with the nightlies to see if your issue is resolved. We can merge the other PR if the issue is not fully resolved.

Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Tried the fix. and it seems to address the crash

@agunapal agunapal added this pull request to the merge queue Jul 31, 2024
Merged via the queue into pytorch:master with commit e5db414 Jul 31, 2024
9 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TorchServe crashes in production with `WorkerThread - IllegalStateException error'
2 participants