Delay between prompt submission and first token generation with longer prompts #1046

jhthompson12 · 2023-12-26T19:12:50Z

jhthompson12
Dec 26, 2023

First off, llama-cpp-python has been very fun to use and im very grateful for all the work done here to make this so accessible!

I've been building a RAG pipeline using the llama-cpp-python OpenAI compatible server functionality and have been working my way up from running on just a laptop to running this on a dedicated workstation VM with access to an Nvidia A100. After the most recent transition to a machine with access to this A100 i was expecting (naively?) this RAG pipeline to be blazing fast, but i've been surprised to find that this is not currently the case.

What im experiencing is a seemingly linear relationship between the length of my prompt and the time it takes to get back the first response tokens (with streaming enabled):

a few sentences --> very short time to first response tokens
a few paragraphs (~2600 tokens) --> around 1 minute to first response tokens

But once the tokens start streaming the response time is very acceptable.

The culprit for the initial delay seems to be the first run of the self.eval(tokens) method.

Im very new to LLMs and GPUs so im trying to understand:

why this first run of self.eval(tokens) takes so long for longer prompts
is there anything that I can do to improve this delay?
- Have I configured something wrong and this eval step is running on the CPU instead of GPU? Or is this just the way it is and there's no way to improve with my current setup?

If there is nothing to improve my current setup is there any reason to believe that other tools to run Llama2 like HuggingFace's Text Generation Interface or vLLM would somehow be faster?

Other useful details:

Nvidia A100 GPU

Im fairly certain that the GPU is actually being fully utilized to llama-cpp-python server's fullest abilities given the debugging output:

llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  107.56 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: VRAM used: 8694.21 MiB

call to start the server:

python -m llama_cpp.server --model D:\LLM_Work\cache\TheBloke\llama-2-13b-chat.Q5_K_M.gguf --n_gpu_layers -1 --n_ctx 3900 --cache False

Thanks in advance for your time!

AdityaKulshrestha · 2024-02-27T21:08:32Z

AdityaKulshrestha
Feb 27, 2024

@jhthompson12 Hi there!

I am trying to solve a similar problems and encountered with same issue.
Any lead you have here? What do you suggest or how can we reduce the time too get the first token.

1 reply

jhthompson12 Feb 27, 2024
Author

Unfortunately, I have not yet found an answer. My suspicion is that using some other LLM backend like vLLM or HF Text Generation Interface might improve this aspect of performance, but I have no evidence to back that up yet 😔

hule369 · 2025-01-06T22:21:41Z

hule369
Jan 6, 2025

Did you ever figure this out

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay between prompt submission and first token generation with longer prompts #1046

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Delay between prompt submission and first token generation with longer prompts #1046

jhthompson12 Dec 26, 2023

Replies: 2 comments · 1 reply

AdityaKulshrestha Feb 27, 2024

jhthompson12 Feb 27, 2024 Author

hule369 Jan 6, 2025

jhthompson12
Dec 26, 2023

Replies: 2 comments 1 reply

AdityaKulshrestha
Feb 27, 2024

jhthompson12 Feb 27, 2024
Author

hule369
Jan 6, 2025