Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU deadlock for pytorch models using the python wrapper #1662

Closed
parthshah86 opened this issue Apr 4, 2020 · 6 comments
Closed

GPU deadlock for pytorch models using the python wrapper #1662

parthshah86 opened this issue Apr 4, 2020 · 6 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@parthshah86
Copy link

parthshah86 commented Apr 4, 2020

We are using seldon-core to serve the PyTorch BERT model based on https://github.com/huggingface/transformers using the python wrapper serving option. We have the Java version of the seldon operator enabled.
When there are multiple concurrent request to the model server, we observe that all the request timeout and on the seldon serving pod, GPU and CPU utilization goes to 100%. Even after we kill the requests the utilization does not go down.

Logs from seldon-container-engine:

org.springframework.web.client.ResourceAccessException: I/O error on POST request for "http://localhost:9000/predict": Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out

This issue seems to be related to pytorch/pytorch#22259, where PyTorch can have GPU deadlock with multiple requests.

With the simple python wrapper it will be useful to expose a threading param to control the no of concurrent threads.

Seldon Version - 1.0.2
python version - 3.6

@parthshah86 parthshah86 added the triage Needs to be triaged and prioritised accordingly label Apr 4, 2020
@axsaucedo
Copy link
Contributor

@parthshah86 this is interesting, it would not be too complex to expose the ability to control the threading parameter on the Python Wrapper level, but the Service Orchestrator (both Java and Go optoins) are still going to continue running requests in parallel, so it would be good to make sure this can be achieve end-to-end. If you are using an ingress provider like Ambassador, you may also be able to leverage the "circuit breaker" functionality that is being added through #1661, which would ensure a cap on parallel requests end-to-end. Having said that, even with the circuit-breaker in place, it does sound like it will be key to also make sure the component that is coordinating the requests have the correct logic to also limit the number of concurrent requests being sent.

@axsaucedo
Copy link
Contributor

One more thing to mention is that we have been thinking of potentially creating a pre-packaged model server specifically for hugging face transformers and/or models, so it would be interesting to see what your python wrapper looks like as we could perhaps generalise it into a pre-packaged model server

@ukclivecox ukclivecox removed the triage Needs to be triaged and prioritised accordingly label Apr 9, 2020
@seldondev
Copy link
Collaborator

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale

@seldondev seldondev added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 10, 2020
@axsaucedo
Copy link
Contributor

axsaucedo commented Jul 15, 2020

This is now resolved as the seldon model server now allows running as single thread

@parthshah86
Copy link
Author

Sorry, somehow missed this message. We did make a patch in seldon-core to support single-threaded model server support.

Interesting to know that you are thinking about supporting native hugging face transformer server.
On our side, our model server takes an s3 path and the model server loads the model, though there can be logic to batch, tokenize the examples as it is passed into the model, and some post-processing to clean up the results.

Thanks for the response @axsaucedo !!

@axsaucedo
Copy link
Contributor

Great to hear it has been resolved @parthshah86 - we have also released batch capabilities (which may be disjiont to what you mentioned above if you were referring to multibatching) https://docs.seldon.io/projects/seldon-core/en/latest/servers/batch.html

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants