How to allow parallel inference requests #1725

webhaypac · 2024-09-03T05:01:43Z

webhaypac
Sep 3, 2024

Hi,
I am running on GPU using the following command.
python3 -m llama_cpp.server --host xx.xx.xxx.xx --port 4444 --model /home/user1/llama.cpp-old/models/codellama-7b-instruct.Q8_0.gguf --n_gpu_layers -1 --n_threads 5 --n_threads_batch 5 --interrupt_requests false

but, I still can't get concurrent inference requests to work. I just need to allow multiple inference requests to return answers in the same time. Currently, all the requests gets queued and handled one at a time.

any idea how to do that?

Thank you

rookiemann · 2024-09-22T14:29:14Z

rookiemann
Sep 22, 2024

I would like to know this too, does llama-cpp-python have a native in-built way to do this?

0 replies

RobinBially · 2024-09-26T21:25:18Z

RobinBially
Sep 26, 2024

also tried these as described here, doesn't work:

LLAMA_ARG_N_PARALLEL=3 python -m llama_cpp.server --model ./cache/llamacpp/qwen2-0_5b-instruct-q4_0.gguf
python -m llama_cpp.server --model ./cache/llamacpp/qwen2-0_5b-instruct-q4_0.gguf --parallel 3
python -m llama_cpp.server --model ./cache/llamacpp/qwen2-0_5b-instruct-q4_0.gguf -np 3

0 replies

javi22020 · 2024-10-21T14:43:47Z

javi22020
Oct 21, 2024

Any solution?

0 replies

istvanpatai · 2024-12-09T13:33:51Z

istvanpatai
Dec 9, 2024

I'm looking for this also... Any updates on this?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to allow parallel inference requests #1725

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to allow parallel inference requests #1725

webhaypac Sep 3, 2024

Replies: 4 comments

rookiemann Sep 22, 2024

RobinBially Sep 26, 2024

javi22020 Oct 21, 2024

istvanpatai Dec 9, 2024

webhaypac
Sep 3, 2024

rookiemann
Sep 22, 2024

RobinBially
Sep 26, 2024

javi22020
Oct 21, 2024

istvanpatai
Dec 9, 2024