GPU optimization across different cards #1427
Replies: 8 comments 8 replies
-
It could be determined on runtime by timing and tuning some parameters, not the most optimal but for most users, acceptable. Or there could be another utility that outputs a config file. These files could be added to the repo by PRs to share with other users. This is pretty much how CLBlast does it. |
Beta Was this translation helpful? Give feedback.
-
Some numbers for my old 1080Ti for 13B, and shiny new 3090Ti for 13B and 30B:
13B, 1080Ti:
13B, 3090Ti:
30B, 3080Ti:
|
Beta Was this translation helpful? Give feedback.
-
So, do you consider add both implementations into the current build (that can be switch with an single arg), then ask people to report the speed difference on these implementations? |
Beta Was this translation helpful? Give feedback.
-
More or less, but it's also about e.g. determining which block sizes work well for specific cards. |
Beta Was this translation helpful? Give feedback.
-
PC - 12400/3070ti, model Wizard-Vicuna-13B-Uncensored.ggml.q8_0. It's faster, but not too much. Maybe i did something wrong? |
Beta Was this translation helpful? Give feedback.
-
Some additional comparative numbers for 13B indicating that the performance improvement is much more apparent for the 3090Ti compared to the 1080Ti. However, even the 1080Ti is appreciably improved. Before:
13B, 1080Ti, 16xPCIe Gen3:
13B, 3090Ti, 16xPCIe Gen3:
After:
13B, 1080Ti, 16xPCIe Gen3:
13B, 3090Ti, 16xPCIe Gen3:
|
Beta Was this translation helpful? Give feedback.
-
@ggerganov as requested here are benchmarks on an A6000. CPU is SetupSetup
30B30B: 30B log
65B65B: 65B log
wow!I am really impressed by these figures. llama.cpp is getting really close to pytorch GPU inference. I went on to test one of the models I uploaded to HF recently, gpt4-alpaca-lora_mlp-65B.
q4_0 GGML:
Result: 6.86 tokens/s (145.73 ms per run) 4bit GPTQ, tested with AutoGPTQ in CUDA modeResult: 12.46 tokens/s 4bit GPTQ, tested with AutoGPTQ in Triton modeResult: 6.12 tok/s So llama.cpp is getting really competitive with pytorch/transformers GPU inference and even beating Triton code. Well done! |
Beta Was this translation helpful? Give feedback.
-
Quick and dirty test using Wizard-Vicuna-13B-Uncensored.ggml.q4_0.bin GPU RTX 3060 12gb - CPU Ryzen 5600x - RAM 32gb - OS Linux Mint. Starting prompt: Before:
After (with --n-gpu-layers 40):
In some previous runs with --n-gpu-layers 40 I had even faster times in some cases. Overall, a great jump in inference speed. |
Beta Was this translation helpful? Give feedback.
-
During the implementation of CUDA-accelerated token generation there was a problem when optimizing performance: different people with different GPUs were getting vastly different results in terms of which implementation is the fastest. For example, @ggerganov did an alternative implementation that was 1.4 times faster on his RTX 4080 but 2 times slower on my GTX 1070. The point of this discussion is how to resolve this issue.
I personally believe that there should be some sort of config files for different GPUs. The user could then maybe use a CLI argument like
--gpu gtx1070
to get the GPU kernel, CUDA block size, etc. that provide optimal performance. The determination of the optimal configuration could then be outsourced to users who don't need programming knowledge to find out the optimal parameters for specific GPUs; they only need to edit a config file and test whether the program becomes slower or faster.Beta Was this translation helpful? Give feedback.
All reactions