Replies: 9 comments 27 replies
-
in general when using the transformer architecture the needed computation for inference scales linearily with the provided context. Other implementations can use parallel computing on gpu to forward pass all tokens at the same time, but cpu optimized implementation computes forward pass for each token one by one. |
Beta Was this translation helpful? Give feedback.
-
what I can see in the code of main.cpp is that the programm iterates through the prompt (or subsequent user input) and every time it hits batch size (params.n_batch) number of tokens it has to break. Then it makes an inference on the model but discards the result if there is still user input to read. |
Beta Was this translation helpful? Give feedback.
-
I'm not sure I follow 🤔 Batching enables executing more operations in a single shot, which is generally faster than doing them one by one sequentially. But at the end of the day, if the batch size is 8, you're doing 8 times more work. The batch size is configurable, and while it may affect performance, setting it to higher values does not make everything proportionally faster. If that were true, setting batch size to 4 would make things twice as slow, and using 16 would make them twice as fast. It's easy to verify that this is not the case. That would not make any sense at all. So I would say, the following statement is not accurate:
|
Beta Was this translation helpful? Give feedback.
-
Setting the batch size to a very high number still produces the same result. The whole prompt is ingested in one go, yes, but the processing time takes as long as generating the same number of tokens. Try this little experiment:
|
Beta Was this translation helpful? Give feedback.
-
This is not a llama.cpp problem this is a 4bit problem. 8bit does not have this, sure it's slow but it starts generating right at the start. But 4bit has a delay before anything starts. GPU/CPU does not matter there's a delay with 4bit. |
Beta Was this translation helpful? Give feedback.
-
What I'm really puzzled about is why all other implementations don't seem to have this issue. Even when running llama they'll process the initial prompt in O(1) rather than O(n) |
Beta Was this translation helpful? Give feedback.
-
Just for comparison I tried using the original ggml-model-f16.bin files instead, which should not require a dequantization step. However, prompt ingestion still seems to be slower than you would expect, although it does appear a bit quicker. |
Beta Was this translation helpful? Give feedback.
-
@ggerganov some new findings: 4-bit GPTQ was having similar issues with ingesting large prompts qwopqwop200/GPTQ-for-LLaMa#82 and they've apparently found a bunch of solutions, see qwopqwop200/GPTQ-for-LLaMa#87 and https://github.com/fpgaminer/GPTQ-triton |
Beta Was this translation helpful? Give feedback.
-
Is it possible to use llama.cpp with 8bit quantizations? How can that be accomplished? |
Beta Was this translation helpful? Give feedback.
-
Apologize if this is an obvious question.
I've used other text inference frameworks before such as huggingface's transformer generate(), and in those cases, the generation time was always independent of the initial prompt length. Only the quantity of generated tokens mattered, regardless of context length. That is to say, generating 5 tokens using a 300 word prompt takes about the same time as generating 5 tokens with 3 word prompt.
However when using Llama.cpp, this does not seem to be the case - the generation time scales almost linearly with initial prompt length. What's the difference?
Beta Was this translation helpful? Give feedback.
All reactions