Unexpected performance issue with longer prompts? #938

catid · 2023-04-13T07:09:31Z

Got pretty far through implementing a llama.cpp-based tool that uses 65B model to do static code analysis, but ran into a wall. The ggml inference engine gets incredibly slow when the past context is long, which is very different from GPU behavior.

The GPU version of my code only gets about 2x slower when there's a long prompt, but the ggml CPU version is like 100x slower. This makes my idea not work on CPU which makes me sad.

I was expecting it to take about 1 second per token so maybe 4 seconds to generate a score between 0...1 for each function in C++ code, which would have been fine.

Maybe this is a performance bug in llama_eval()? The main reason I'm coming to this conclusion is that I'm observing that using the ./main chat app, it takes time per input token as well as per output token, while the HuggingFace LLaMA library practically doesn't care how long the input is - Performance is only 2x worse at most.

Here's my branch: https://github.com/catid/llamanal.cpp/tree/main/examples/analysis

Test code is here: https://github.com/catid/llamanal.cpp/blob/d9f666a39c1a2e82a34e1508ba4c6121cae7a932/examples/analysis/oracle.cpp#L52

ggerganov · 2023-04-13T08:10:34Z

Currently, I am not convinced it is a bug in llama.cpp, but probably there is some room for improvement.
The GPU has a much higher memory throughput and for prompt processing, the computation is highly parallel so I expect it to be orders of magnitude faster compared to the CPU.

Here are a few things to try to improve the performance of llama.cpp for large prompt processing:

Use OpenBLAS for better CPU-only performance:

apt install libopenblas-dev

make clean && LLAMA_OPENBLAS=1 make

Use NVBLAS or CLBlast to utilize the GPU without any changes to ggml or llama.cpp:
- PoC 1: Experiments with GPU CUDA acceleration...sort of whisper.cpp#220
- PoC 2: https://github.com/LostRuins/koboldcpp
Run on Apple Silicon - this will automatically utilize the AMX co-processor and give a significant performance improvemt

I hope at some point we will have better GPU support in ggml, but this will probably take some time.

Otherwise, this is a very cool idea and I will be very happy if you succeed in implementing it and make it run efficiently!
I think the quantization accuracy improvements that are on the way might also be useful to your project.

catid · 2023-04-13T20:36:52Z

Thanks for taking a look! OpenBLAS helped, but I agree the issue appears to just be a compute bottleneck and this will require GPUs to run.

ggerganov · 2023-04-20T07:41:36Z

@catid

With the recently added cuBLAS support, people are reporting significant speed improvements of running large prompt inference: #1065 (comment) (multiple times faster than before, depending on the GPU that you have).

The current approach allows to take advantage of GPUs with low VRAM even on largest models since the model weights are passed to the GPU "on-demand" instead of loading everything in VRAM.

Maybe you will be interested to give your idea another try with the latest version of this repo and enabling "cuBLAS"

Signed-off-by: caiyesd <caiyesd@gmail.com>

* Gradient rope formula with offsets Positive for Solar models Negative for Llama 1 and 2 models * Update gpttype_adapter.cpp Remove L1/L2 * cleanup PR, skip llama models, keep prints behind debug mode --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>

catid closed this as completed Apr 13, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

Add chat format to support baichuan (ggerganov#938)

4184835

Signed-off-by: caiyesd <caiyesd@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected performance issue with longer prompts? #938

Unexpected performance issue with longer prompts? #938

catid commented Apr 13, 2023 •

edited

Loading

ggerganov commented Apr 13, 2023

catid commented Apr 13, 2023

ggerganov commented Apr 20, 2023

Unexpected performance issue with longer prompts? #938

Unexpected performance issue with longer prompts? #938

Comments

catid commented Apr 13, 2023 • edited Loading

ggerganov commented Apr 13, 2023

catid commented Apr 13, 2023

ggerganov commented Apr 20, 2023

catid commented Apr 13, 2023 •

edited

Loading