Trying to understand why starting with a long prompt is so much slower. #229

LostRuins · 2023-03-17T08:00:37Z

LostRuins
Mar 17, 2023
Collaborator

Apologize if this is an obvious question.

I've used other text inference frameworks before such as huggingface's transformer generate(), and in those cases, the generation time was always independent of the initial prompt length. Only the quantity of generated tokens mattered, regardless of context length. That is to say, generating 5 tokens using a 300 word prompt takes about the same time as generating 5 tokens with 3 word prompt.

However when using Llama.cpp, this does not seem to be the case - the generation time scales almost linearly with initial prompt length. What's the difference?

DKormann · 2023-03-17T16:30:57Z

DKormann
Mar 17, 2023

in general when using the transformer architecture the needed computation for inference scales linearily with the provided context. Other implementations can use parallel computing on gpu to forward pass all tokens at the same time, but cpu optimized implementation computes forward pass for each token one by one.

0 replies

DKormann · 2023-03-17T16:47:20Z

DKormann
Mar 17, 2023

what I can see in the code of main.cpp is that the programm iterates through the prompt (or subsequent user input) and every time it hits batch size (params.n_batch) number of tokens it has to break. Then it makes an inference on the model but discards the result if there is still user input to read.
edit: while increasing batch size might improve performance it does not lead to linear improvement. Increasing batch size reduces iterations but each iteration is computationally bigger. in my experience (7b model 8gb ram) highe batch size lead to modest imprvements reading in long prompts

0 replies

setzer22 · 2023-03-17T17:08:41Z

setzer22
Mar 17, 2023

I'm not sure I follow 🤔

Batching enables executing more operations in a single shot, which is generally faster than doing them one by one sequentially. But at the end of the day, if the batch size is 8, you're doing 8 times more work.

The batch size is configurable, and while it may affect performance, setting it to higher values does not make everything proportionally faster. If that were true, setting batch size to 4 would make things twice as slow, and using 16 would make them twice as fast. It's easy to verify that this is not the case. That would not make any sense at all.

So I would say, the following statement is not accurate:

TLDR: with a default batch size of 8 it takes as much compute to read in 8 input tokens as to produce one output token.

0 replies

LostRuins · 2023-03-18T02:44:41Z

LostRuins
Mar 18, 2023
Collaborator Author

Setting the batch size to a very high number still produces the same result. The whole prompt is ingested in one go, yes, but the processing time takes as long as generating the same number of tokens.

Try this little experiment:

Open ChatGPT and send a one word prompt "Hello". Observe time before reply starts.
Then open a new session and send a long prompt "Hello hello hello hello..."x100
They both seem to take about the same time to begin responding.
Now repeat this in llama.cpp, setting batch size to cover the full prompt with -b 500

3 replies

DKormann Mar 18, 2023

i think you are right. I did set batchsize to 16 and it increased the speed for one prompt but it doesnt seem to generalize well to other batchsizes.

DKormann Mar 18, 2023

on further experimentation setting -b 16 increases speed fo me but not nearly 2 times. As far as I understand batch size in inference right now higher batch size means fewer inference iterations but at higher computational cost each.
So it doesnt seem like there is a simple solution to make ingestion of long prompts faster?
if someone has information on how batch size works exactly in inference please let me know

LostRuins Mar 18, 2023
Collaborator Author

I'm wondering is a full forward pass necessary for each token embedding in the original prompt? I tried skipping it for the initial prompt, but it produced broken output. I don't understand how LLMs work enough to fix this myself.

USBhost · 2023-03-19T04:48:36Z

USBhost
Mar 19, 2023

This is not a llama.cpp problem this is a 4bit problem. 8bit does not have this, sure it's slow but it starts generating right at the start. But 4bit has a delay before anything starts. GPU/CPU does not matter there's a delay with 4bit.

5 replies

LostRuins Mar 19, 2023
Collaborator Author

Any idea why this is the case?

DKormann Mar 19, 2023

Are you sure this is the case even for long prompts?

LostRuins Mar 19, 2023
Collaborator Author

Yes. I've timed it. Starting generation on a 500 word prompt takes over a minute.

USBhost Mar 20, 2023

oobabooga/text-generation-webui#391
qwopqwop200/GPTQ-for-LLaMa#34

LostRuins Mar 20, 2023
Collaborator Author

That is good information. So it seems that GPTQ has a similar latency problem. However, llama.cpp is using RTN for 4 bit quantization rather than GPTQ, so I'm not sure if it's directly related. Maybe one of the devs can clarify?

LostRuins · 2023-03-22T03:09:30Z

LostRuins
Mar 22, 2023
Collaborator Author

What I'm really puzzled about is why all other implementations don't seem to have this issue. Even when running llama they'll process the initial prompt in O(1) rather than O(n)

6 replies

jarcen Mar 23, 2023

Yet, as noted above other implementations have no such issue. That's also my experience with running pytorch transformers on CPU. It's absolutely not O(1) as amount of work to do is linear, but definitely in a range of ten times faster when reading prompts.

LostRuins Mar 23, 2023
Collaborator Author

When running CPU-only pytorch, the generation throughput speed is super slow (<1 token a second) but the initial prompt still gets processed super fast (<5 seconds latency to start generating on 1024 context). On llama.cpp this is the opposite.

ggerganov Mar 23, 2023
Maintainer

I have to look a bit more in details, but I believe that when we add SIMD dequantization and you enable OpenBLAS / Accelerate, it will be much faster to process large number of tokens in the initial prompt. Currently, enabling OpenBLAS / Accelerate is slow because there is a dequantization step needed to bring the data back to F32 that I believe takes a lot of time (not 100% sure).

Current implementation invokes BLAS (if enabled) for inputs with >= 32 tokens:

llama.cpp/ggml.c

Lines 5710 to 5732 in a18c192

    
           // helper function to determine if it is better to use BLAS or not 
        
           // for large matrices, BLAS is faster 
        
           static bool ggml_compute_forward_mul_mat_use_blas( 
        
                   const struct ggml_tensor * src0, 
        
                   const struct ggml_tensor * src1, 
        
                         struct ggml_tensor * dst) { 
        
               UNUSED(src0); 
        
               const int ne10 = src1->ne[0]; 
        
               const int ne0 = dst->ne[0]; 
        
               const int ne1 = dst->ne[1]; 
        
               // TODO: find the optimal values for these 
        
               if (ggml_is_contiguous(src0) && 
        
                   ggml_is_contiguous(src1) && ((ne0 >= 32 && ne1 >= 32 && ne10 >= 32))) { 
        
                   //printf("BLAS: %d %d %d\n", ne0, ne1, ne10); 
        
                   return true; 
        
               } 
        
               return false; 
        
           } 
        
           #endif

My suspicion is that this is currently slow:

llama.cpp/ggml.c

Lines 6373 to 6380 in a18c192

    
           int id = 0; 
        
           for (int i01 = 0; i01 < ne01; ++i01) { 
        
               //for (int i00 = 0; i00 < ne00; ++i00) { 
        
               //    wdata[id++] = GGML_FP16_TO_FP32(*(ggml_fp16_t *) ((char *) src0->data + i03*nb03 + i02*nb02 + i01*nb01 + i00*nb00)); 
        
               //} 
        
               dequantize_row_q4_0((char *) src0->data + i03*nb03 + i02*nb02 + i01*nb01, wdata + id, ne00); 
        
               id += ne00; 
        
           }

Alternatively, we can try to implement an efficient block-based matrix multiplication, instead of the existing row-by-row one and not rely on BLAS. But I am not sure I know how to implement it. And I guess the code will become super complex.

LostRuins Mar 24, 2023
Collaborator Author

Thanks for the reply, I'm glad to know you're looking into it!

For anyone else facing the same issue - I did end up using a hacky band-aid fix of caching the array of token IDs for each submitted prompt - and when I encountered a subsequent prompt that shared part of a previous prompt tokens, I could "fast-forward" the n_past and shorten the embd_inp array to avoid reprocessing parts of the prompt with common ancestry.

So if prompt1 = "The quick brown fox jumps over the lazy dog" and prompt2 = "The quick brown fox eats some food", I could set n_past=3 and embd_inp truncated to "eats some food", and somewhat reduce the initial prompt processing time. I dunno if this causes bad side effects.

This works for minor edits towards the end of the prompt, as well as for continuation of a previous prompt. Sadly if the start of the prompt is modified then the whole thing has to be recomputed which takes multiple minutes.

example.mp4

slaren Mar 25, 2023
Collaborator

This is probably stupid and maybe ggml already works this way, but I am wondering, since the main bottleneck seems to be memory bandwidth, could the batches be processed in a way such that the weights only need to be read once for all the items in the batch?

LostRuins · 2023-03-25T03:02:03Z

LostRuins
Mar 25, 2023
Collaborator Author

Just for comparison I tried using the original ggml-model-f16.bin files instead, which should not require a dequantization step. However, prompt ingestion still seems to be slower than you would expect, although it does appear a bit quicker.

13 replies

linouxis9 Mar 28, 2023

By default llama.cpp will limit ggml to 1 thread when using BLAS only if the batch size is >255:

llama.cpp/llama.cpp

Line 859 in 4b8efff

gf.n_threads = N > 255 && ggml_cpu_has_blas() ? 1 : n_threads;

The problem is that there is a mismatch in ggml which will use BLAS as long as the batch size is >= 32:

llama.cpp/ggml.c

Line 5784 in 4b8efff

ggml_is_contiguous(src1) && ((ne0 >= 32 && ne1 >= 32 && ne10 >= 32))) {

This leads to issues when the batch size >=32 and <=255. We need to determine what is the optimal batch size to start using BLAS, and use that value consistently.

You are right @slaren!! I was using a batch size <= 255, that explains the weird behavior. Thanks :-)

Piezoid Mar 28, 2023

@linouxis9 When a batch size > 255 (-b 256) is used, ggml will use a single thread for prompt processing and let OpenBLAS do the multiprocessing. See this commit 4640eff2
(woops, I'm too slow)

Using more than 6 ggml threads is very slow, I believe that the efficiency cores are bottlenecking.
I opened a PR to OpenBLAS to improve the issue I had with it on Intel 13th gen: OpenMathLib/OpenBLAS#3970

Also, both OpenBLAS and ggml try to use the hyperthreaded cores which is counterproductive.

linouxis9 Mar 28, 2023

@Piezoid @slaren I tried using -b 256 and -b 512, and ggml's 6 threads (from -t 6) are still spawned by ggml (alongside BLAS threads) when doing initial prompt ingestion:

llama -m /opt/models/llama-30B/ggml-model-q4_0.bin -n -1 --color -i -r "User:" -f /opt/prompts/chat-with-bob.txt -t 6 -b 256 -c 2048

Using -t 1 yields the expected behavior (only 1 thread for ggml, and the threads I set in env variable for BLAS)

slaren Mar 28, 2023
Collaborator

Even if you use -b 512, the last batch of the prompt may have less than 256 tokens which will still cause llama.cpp to instruct ggml to use more threads for that last batch, even if BLAS will be used. As I said, the mismatch needs to be fixed.

linouxis9 Mar 28, 2023

I understand, thank you :-) Opened issue: #578

LostRuins · 2023-03-28T08:13:28Z

LostRuins
Mar 28, 2023
Collaborator Author

@ggerganov some new findings: 4-bit GPTQ was having similar issues with ingesting large prompts qwopqwop200/GPTQ-for-LLaMa#82 and they've apparently found a bunch of solutions, see qwopqwop200/GPTQ-for-LLaMa#87 and https://github.com/fpgaminer/GPTQ-triton

0 replies

conorwsullivan · 2023-03-28T10:50:10Z

conorwsullivan
Mar 28, 2023

Is it possible to use llama.cpp with 8bit quantizations? How can that be accomplished?

0 replies

Trying to understand why starting with a long prompt is so much slower. #229

LostRuins Mar 17, 2023 Collaborator

Replies: 9 comments · 27 replies

LostRuins Mar 18, 2023 Collaborator Author

LostRuins Mar 18, 2023 Collaborator Author

LostRuins Mar 19, 2023 Collaborator Author

LostRuins Mar 19, 2023 Collaborator Author

LostRuins Mar 20, 2023 Collaborator Author

LostRuins Mar 22, 2023 Collaborator Author

LostRuins Mar 23, 2023 Collaborator Author

ggerganov Mar 23, 2023 Maintainer

LostRuins Mar 24, 2023 Collaborator Author

slaren Mar 25, 2023 Collaborator

LostRuins Mar 25, 2023 Collaborator Author

slaren Mar 28, 2023 Collaborator

LostRuins Mar 28, 2023 Collaborator Author

LostRuins
Mar 17, 2023
Collaborator

Replies: 9 comments 27 replies

LostRuins
Mar 18, 2023
Collaborator Author

LostRuins Mar 18, 2023
Collaborator Author

LostRuins Mar 19, 2023
Collaborator Author

LostRuins Mar 19, 2023
Collaborator Author

LostRuins Mar 20, 2023
Collaborator Author

LostRuins
Mar 22, 2023
Collaborator Author

LostRuins Mar 23, 2023
Collaborator Author

ggerganov Mar 23, 2023
Maintainer

LostRuins Mar 24, 2023
Collaborator Author

slaren Mar 25, 2023
Collaborator

LostRuins
Mar 25, 2023
Collaborator Author

slaren Mar 28, 2023
Collaborator

LostRuins
Mar 28, 2023
Collaborator Author