Skip to content

what is the strategy of triton for running models in parallel, multi-thread or multi-process? #6253

Answered by dyastremsky
heivens asked this question in Q&A
Discussion options

You must be logged in to vote

This differs based on the backend and model configuration. For example, Python backend runs models in their own processes. TensorRT uses CUDA streams. It also depends on your model configuration (e.g. if you specify multiple model instances, they will be on the same device, so many backends use multi-threading to enable parallel inference).

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by dyastremsky
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants