-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected performance issue with longer prompts? #938
Comments
Currently, I am not convinced it is a bug in llama.cpp, but probably there is some room for improvement. Here are a few things to try to improve the performance of
I hope at some point we will have better GPU support in ggml, but this will probably take some time. Otherwise, this is a very cool idea and I will be very happy if you succeed in implementing it and make it run efficiently! |
Thanks for taking a look! OpenBLAS helped, but I agree the issue appears to just be a compute bottleneck and this will require GPUs to run. |
With the recently added cuBLAS support, people are reporting significant speed improvements of running large prompt inference: #1065 (comment) (multiple times faster than before, depending on the GPU that you have). The current approach allows to take advantage of GPUs with low VRAM even on largest models since the model weights are passed to the GPU "on-demand" instead of loading everything in VRAM. Maybe you will be interested to give your idea another try with the latest version of this repo and enabling "cuBLAS" |
Signed-off-by: caiyesd <caiyesd@gmail.com>
* Gradient rope formula with offsets Positive for Solar models Negative for Llama 1 and 2 models * Update gpttype_adapter.cpp Remove L1/L2 * cleanup PR, skip llama models, keep prints behind debug mode --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Got pretty far through implementing a llama.cpp-based tool that uses 65B model to do static code analysis, but ran into a wall. The ggml inference engine gets incredibly slow when the past context is long, which is very different from GPU behavior.
The GPU version of my code only gets about 2x slower when there's a long prompt, but the ggml CPU version is like 100x slower. This makes my idea not work on CPU which makes me sad.
I was expecting it to take about 1 second per token so maybe 4 seconds to generate a score between 0...1 for each function in C++ code, which would have been fine.
Maybe this is a performance bug in llama_eval()? The main reason I'm coming to this conclusion is that I'm observing that using the ./main chat app, it takes time per input token as well as per output token, while the HuggingFace LLaMA library practically doesn't care how long the input is - Performance is only 2x worse at most.
Here's my branch: https://github.com/catid/llamanal.cpp/tree/main/examples/analysis
Test code is here: https://github.com/catid/llamanal.cpp/blob/d9f666a39c1a2e82a34e1508ba4c6121cae7a932/examples/analysis/oracle.cpp#L52
The text was updated successfully, but these errors were encountered: