Delay between prompt submission and first token generation with longer prompts #1046
Unanswered
jhthompson12
asked this question in
Q&A
Replies: 1 comment 1 reply
-
@jhthompson12 Hi there! I am trying to solve a similar problems and encountered with same issue. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
First off, llama-cpp-python has been very fun to use and im very grateful for all the work done here to make this so accessible!
I've been building a RAG pipeline using the llama-cpp-python OpenAI compatible server functionality and have been working my way up from running on just a laptop to running this on a dedicated workstation VM with access to an Nvidia A100. After the most recent transition to a machine with access to this A100 i was expecting (naively?) this RAG pipeline to be blazing fast, but i've been surprised to find that this is not currently the case.
What im experiencing is a seemingly linear relationship between the length of my prompt and the time it takes to get back the first response tokens (with streaming enabled):
But once the tokens start streaming the response time is very acceptable.
The culprit for the initial delay seems to be the first run of the
self.eval(tokens)
method.Im very new to LLMs and GPUs so im trying to understand:
self.eval(tokens)
takes so long for longer promptseval
step is running on the CPU instead of GPU? Or is this just the way it is and there's no way to improve with my current setup?If there is nothing to improve my current setup is there any reason to believe that other tools to run Llama2 like HuggingFace's Text Generation Interface or vLLM would somehow be faster?
Other useful details:
Thanks in advance for your time!
Beta Was this translation helpful? Give feedback.
All reactions