-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have n_batch default to 512 when BLAS is enabled #1091
Conversation
It may be better to check for BLAS availability with |
Yeah that would also work. It really is the preference between hardcoding this default in the struct itself or bumping it up manually after instantiating |
The initializer doesn't have to be constant, you can do something like this: int32_t n_batch = ggml_cpu_has_blas() ? 512 : 8; // batch size for prompt processing |
Aargh, I'm getting rusty 😉 94cb00a is the alternate implementation with |
Here's some prompt eval time results with q4_0 13B Llama, OpenBLAS, and my standard 320 token prompt. I ran this on Linux with a 16GB i5-6500. n_batch of 512 is a clear winner in my environment.
|
512 is also good for cuBLAS. |
Wonder if we should use 512 even without BLAS. Adding |
From a user experience point of view, the smaller batch sizes let you know where you are in your initial prompt evaluation as each batch is written out right before evaluation. Personally I'd happily trade that away for quicker processing but I do see how it might be nice to watch it churn through your prompt on those slow larger models. I doubt anyone would make anything so intricate but it might be nice if there were some sort of callback to give you a percentage update for anyone who wants to make a GUI and wants to use larger batch sizes to utilize BLAS. (They could probably just evaluate a single token or two and then estimate the time to do a block of 512.) |
b2e8a32 has n_batch of 512 set by default regardless if BLAS is enabled or not. In my tests I saw a prompt eval time of around 260-270ms/token regardless of batch size without BLAS, using the same setup as my previous tests. I tried batch sizes between 8 and 512. Thus in my case the batch size does not matter. As an aside I'm curious as to why n_batch is hard limited to 512 - is there a technical reason why we can't use larger values? |
I think going above 512 will have to increase some of the buffer sizes in Lines 45 to 97 in 872c365
But not 100% sure. |
Yeah it segfaults alright with the 512 limit removed, n_batch of 2048, and a 2k+ length prompt. The same prompt has no issues with a n_batch of 512.
In valgrind:
This is easy to reproduce (just remove the limit, set n_batch to 2048, and use a big prompt). With a n_batch of 1024 and no limit llama.cpp works fine with 2k+ length prompts, though with OpenBLAS I don't see a performance improvement in prompt ingestion (still around 150ms/token). So for me I don't see the need to support even larger n_batch sizes, though for GPU users it may be a different story. |
Regarding the crash, see #1152 (comment) |
Upvote here for n_batch of 1024 and higher, as I have 128GB of RAM and I was seeing a clear performance trend as reported in #1129 (comment) |
@gjmulder From your comment it looks like you have a GPU available to test with. Could you run with a 1024 or higher batch size and see if it improves your results? Again on CPU it does nothing for me. |
@eiery
EDIT:
|
For perplexity the batch size cannot be greater than the ctx size, therefore in your case it shows 512 as well. If you increase the ctx size to 1024 then it should work.
This is of course assuming that you have already patched |
Here are some CLBlast results on my HD530 iGPU, with the n_batch limit removed and the patch in #1152 (comment) used to get around the segfault. This is on 13B with a 2000 token prompt.
In this case performance plateaus at the 1024 n_batch mark. |
As GGML only uses BLAS when n_batch is 32 or larger it is not used by default even if llama.cpp is compiled with a BLAS lib. This is because the default n_batch is 8. My path sets n_batch to the maximum 512 when BLAS is enabled at compile-time and keeps at at 8 if there is no BLAS.
Personally I see the best performance with OpenBLAS and a size of 512, so this is why I chose this value. Experimentation may be needed to come up with a good default (as long as it's larger than 32).
This came out of a discussion at #1065 (comment).