-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi GPU CUDA - 8x performance degradation when splitting tensors -> let's split by layer as an option #4055
Comments
See the conversation starting at #3776 (comment) . I am aware of the parallelization scheme where the model is split into blocks of layers instead of splitting each layer into slices. As I said before: I have no intention of implementing it. Multi GPU only really makes sense for running something like 70b and for that purpose I think the best buys are either multiple P40s or multiple RTX 3090s. For multiple P40s the current scheme works better while for multiple RTX 3090s NVLink is available which should also result in low parallelization overhead. Synchronization overhead may also vary by OS: if you e.g. use Windows peer access between devices is only available via NVLink so the performance for multiple GPUs working on small batches should be worse.
No, for Also: see #3814 and check whether that PR has unintentionally resulted in a performance regression for your hardware setup. |
That's a pity, Nvlink has been deprecated in 2022 and is not likely going to come back to consumer GPUs. I am aware about the theory but in practice we have a 800-1000% slowdown with the current implementation of tensor split. Best would be to fix the synchronization problem, splitting by layers would be a simple solution solving that problem until synchronization works better. |
From what I see there might be an issue with the produced values being inconsistent between single and multi-GPU setups. Single-GPU:
Multi-GPU:
Edited by JG: use |
@jezzarax There's something really, really weird going on here. According to your logs you get 8+ ppl for single GPU and ~6.4, which is a gigantic difference. Also, multi-GPU is the "weird" scenario but apparently the more typical one is where you get the unexpected result. I'm very skeptical about 8+ being correct, the 6.4 sounds much more reasonable. I don't know anything about multi-GPU so I can't help diagnose the actual problem. |
I also assume something weird happens is in addition to the performance problem.
I think given the high quality state of llama.cpp and considering new models like llama2 70B and falcon 180B being open for our use it would be quite important to get multi GPU working better, closing the performance gap to python. |
The case where they got the unexpected result was for single GPU, as far as I could see. That's what makes it so weird. |
As I said before:
|
Regarding multi-GPU:
Regarding the ppl differences: We need to understand what is going on there.
|
I can do both, got access to 1x node as well. Would |
You'd need to build without GPU support, prompt processing (which is all |
export CUDA_VISIBLE_DEVICES = "-1"; That should enumerate 0 devices to the cuda backend, so nothing could be initialized or sent to a GPU |
Likely a bug was introduced in 4760e7c |
I made multiple runs over two commits and two quantisation levels. I used some commit from two-ish weeks ago and one from yesterday. It looks like there's something strange about f16 quantisation, q8 results seem more consistent.
I'm not able to run f16 for the current version of the code on bigger models for now due to #3930 (comment) If there are any other tests I can run on multi-A100 setup, happy to contribute. |
I am not running batch but I obtain performance comparable to exllama on 3090s and the best multi-gpu P40 speeds. It certainly beats transformers with accelerate or autogptq. I reach speeds similar to metal for large models like falcon with 2 or 3 P40 and 2x3090. I know that pipeline style approaches were tried with llama_inference_offload in the GPTQ original version. They did speed things up past the normal 2 or 3t/s that would come from using accelerate but nowhere near to this. This is all using the MMQ kernels though. The new batch kernels did not improve speeds, even on ampere. Could the eventual vulkan backend be faster than cublas? I am just really confused how people could term multi-gpu in llama.cpp "bad" compared to all the other options. The only time I get slowdowns is prompt processing and I'm not aware how to use the kv_cache token swapping like is done in koboldcpp or if it exists here. |
When 2400 tokens/second drops down to 300 tokens/sec despite using twice the processing hardware, and while inferencing the same model we have a problem that needs solving. That's almost a magnitude in performance lost when adding a second card. I didn't intend to trigger emotions when I used the term "bad" in my later comment, just to point to the problem. |
It's not emotion. It's just my experience with it. Splitting a model over multiple GPUs will always lower performance compared to a single GPU with contiguous memory. Have you tried any other inference engine that do not drop so badly and what was the ratio for 1 card vs 2? |
It's not only about the performance drop. The numbers differ between single and multi-gpu runs, please check the table I've posted above. Producing correct results is crucial. |
Problem:
I am aware everyone has different results, in my case I am running llama.cpp on a 4090 primary and a 3090 secondary, so both are quite capable cards for llms.
I am getting around 800% slowdowns when using both cards on the same model and settings (basically regardless which model I tried), batch processing speed can go down from 2400t/sec to 200-300t/sec (8-10 times slower than on single GPU).
This happens as soon as any tiny bit of processing (-ts) is shifted to the 2nd card.
I assume it is a synchronization problem in the cuda loops, I also assume the issue does not affect every combination of GPUs, especially if one GPU is significantly slower.
Suggestion:
My suggestion is to add a parameter like -layer-split, when this is used the tensors are not split up, instead the layers are split into the cards (using -ls instead of -ts).
This means the calculations can all be computed without synchronization on a single GPU at highest possible performance of that GPU.
Caveat:
In theory tensor split should boost performance, as both cards can process a split tensor at the same time, so it's the better solution but currently that's so far from reality, the suggested layer split should significantly boost the processing speed.
@JohannesGaessler what do you think ?
The text was updated successfully, but these errors were encountered: