-
Notifications
You must be signed in to change notification settings - Fork 830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU deadlock for pytorch models using the python wrapper #1662
Comments
@parthshah86 this is interesting, it would not be too complex to expose the ability to control the threading parameter on the Python Wrapper level, but the Service Orchestrator (both Java and Go optoins) are still going to continue running requests in parallel, so it would be good to make sure this can be achieve end-to-end. If you are using an ingress provider like Ambassador, you may also be able to leverage the "circuit breaker" functionality that is being added through #1661, which would ensure a cap on parallel requests end-to-end. Having said that, even with the circuit-breaker in place, it does sound like it will be key to also make sure the component that is coordinating the requests have the correct logic to also limit the number of concurrent requests being sent. |
One more thing to mention is that we have been thinking of potentially creating a pre-packaged model server specifically for hugging face transformers and/or models, so it would be interesting to see what your python wrapper looks like as we could perhaps generalise it into a pre-packaged model server |
Issues go stale after 30d of inactivity. |
This is now resolved as the seldon model server now allows running as single thread |
Sorry, somehow missed this message. We did make a patch in seldon-core to support single-threaded model server support. Interesting to know that you are thinking about supporting native hugging face transformer server. Thanks for the response @axsaucedo !! |
Great to hear it has been resolved @parthshah86 - we have also released batch capabilities (which may be disjiont to what you mentioned above if you were referring to multibatching) https://docs.seldon.io/projects/seldon-core/en/latest/servers/batch.html |
We are using seldon-core to serve the PyTorch BERT model based on https://github.com/huggingface/transformers using the python wrapper serving option. We have the Java version of the seldon operator enabled.
When there are multiple concurrent request to the model server, we observe that all the request timeout and on the seldon serving pod, GPU and CPU utilization goes to 100%. Even after we kill the requests the utilization does not go down.
Logs from seldon-container-engine:
This issue seems to be related to pytorch/pytorch#22259, where PyTorch can have GPU deadlock with multiple requests.
With the simple python wrapper it will be useful to expose a threading param to control the no of concurrent threads.
Seldon Version - 1.0.2
python version - 3.6
The text was updated successfully, but these errors were encountered: