-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor CPU and GPU utilization in 1.62.2 compared to 1.61.2 for CuBLAS #786
Comments
Did you happen to update/change your driver or have a windows update during this time? One major change between 1.61 and 1.62 was the introduction of ubatch parameter upstream (https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#batch-size) which in KoboldCpp defaults to the same value as used in Another command you can try is Lastly, can you see if you have this regression happening in base llama.cpp too? |
This is not related to blas thread count (I've tried to set it to 16 too). I'll check |
@LostRuins, can you give the exact commit of llama.cpp from which kobold-1.61.2 was built? I've tested current vs. the one that everybody referred to (before model reconversion became needed for MoE), but did not see difference. How do we pinpoint this? If random older versions might be inherently slower, but for newer versions they say "hey, you have to reconvert models" – then I cannot honestly compare against old models anymore! But new models won't work in old version. I'm not telling "make it as before", since if something was changed internally – then it was done for good. Rather:
Also I've tested different blas batch sizes in koboldcpp and it seems the less batchsize is – the more load on CPU but with lower speed too. Without blas batching it maxes to 100% but goes really slow. I tried to test light models with all offloaded layers – and it seem the speed is equal between 1.61 and 1.62, CPU was almost empty. |
1.61.2 is built from this KoboldCpp commit f3b7651 The last sync with upstream source for that build would be |
Something really strange is going on. Firstly, I have downloaded If I don't touch the affinity, allowing the process to use all 16 virtual P-cores and 4 E-cores – then it performs much better!
I remember once I tried to set affinity away from the very first two cores ("CPU 0") – and in that case (allowing koboldcpp to use cores from 2 to 15) my CUDA utilization was around 0%. As if "the main GPU controller thread" was not pushing the work. But now moving Can this mean that only users with Intel Efficient cores could hit the problem? By the way, can you tell me what llama.cpp release distribution archive should I use to get the most similar to koboldcpp behavior for CuBLAS? Maybe some specific command-line parameters? |
My impression is that the CPU cores 'crunch' the data so the GPU may be able to process it. If you use fewer cores, maybe the GPU doesn't have enough crunched input to work on. I use a thread for each physical core; even so, the GPU may not reach 100%, probably because the memory speed inside it is another bottleneck. I let the OS choose the core affinity; the hyperthreading usually allows me to use the CPU for common tasks with no downside, as the niceness of the koboldcpp process is higher (priority is lower). |
Why I am concerned: if I'm pretty sure that E-cores hurting the overall performance when used – then now the performance of all my 20 cores is most likely suboptimal compared to what it could theoretically be without this bug/feature that broke CuBLAS for me! Even if now 20 cores for new version would be faster than 16 cores for older version (I actually need to re-benchmark everything to know for sure…) |
@aleksusklim , note: I'm using Linux; perhaps the OS affinity selector here is better than in Windows. It certainly behaves differently. |
Yeah you cannot choose cublas for main.exe, instead you need to build it with the correct compile flags to enable cuda. For llama.cpp main, I believe they use the flag |
I was able to raise my CUDA utilization up to 60% in koboldcpp for CommandR+ by disabling E-Cores in UEFI settings altogether. This noticeably improved performance of 1.62 (haven't checked 1.63 yet) for all models! Now I'm concerned of how to properly ablate this to find the best settings. Things to change for me could be:
This is too big experiment space! Instead, I think I have to fix a few things:
So, I'll vary only these: algo, threads, batch – with and without offloading; for BLAS and for generation. |
Disabled hyperthreading and found no significant benefits with 8 cores and 8 threads compared to 16 cores and 8 threads. Meanwhile, I also saw an interesting setting in UEFI overclocking tab: (I have MSI PRO Z680-A DDR4 motherboard)
I thought that this can be a good trade-off for me if I could just press scroll-lock when using koboldcpp. |
That Game Compatibility Mode thing sounds super weird. Sounds like a driver bug... does that happen with other software? Might be the reason for your E-Core madness that nobody else really seems to have - a bad driver is messing with the thread scheduling maybe? Have you tried disabling the special features? I'm glad I don't have a modern intel chip, these E cores sound like such a pain to deal with. Maybe I should just use AMD Ryzen in future. |
There are a lot of "Intel Turbo this" and "Intel Technology that" in Advanced Overclocking settings, everything set on "auto" in some way. I think it is possible to find a culprit by trial and errors, but generally disabling E-Cores did the trick, right? By the way, when I loded CommandR+ (into v1.63) with 0 offloaded CuBLAS, the console prints something like "Out Of Memory, failed to allocate ~77Gb of shared memory" (not exact wording; I can write the full log if you wanna see), and I know with 128 Gb of RAM I can have only 64 Gb of "shared" GPU memory. But otherwise it works fine… |
I decided to update my BIOS, lol. Still, E-cores perform worse than if disabled in UEFI but much better than with changed affinity when enabled. Also, I thought the error/warning I mentioned above was caused by
By the way, there is no way to use both Intel and NVidia simultaneously, right? I see only CuBLAS allows to use more than one GPU, but it obviously cannot detect the integrated one. |
Possibly there were some modifications to the cuda kernels. But especially if you offload everything on GPU i think it defaults to one thread only. |
I realized that if I would use low max context size for Mixtral, for example 8k instead of 64k – then I can have CUDA utilization up to 99% during BLAS provided I have disabled E-Cores in UEFI and do not change affinity. Buy the way, I heard that Win10 handles E-Cores poorly, compared to Win11. But I won't try to set up Win11 to confirm that! |
I downloaded CommandR+ and noticed very low CPU usage.
But now I realized that every model performs much worse!
BLAS stage is almost 5x slower in 1.62.2, as if thread count was ignored (again?)
For example, with 1.61.2 on my Core-i7 12700K + RTX 3060 I can ask for 16 threads and see 100% CPU load with 25% CUDA load for Mixtral 8x7b at 4k context using 35 Gb of RAM.
But the exact same setup with the new version brings only 20% CPU load and merely 5% CUDA load! I have 0 offloaded layers which was very good and more performant for large models.
Note:
This is how CPU in Task Manager looks like:
(In the old version all 16 P-cores were maxed out for BLAS, but now they are becoming maxed only for actual generation phase)
Also: #780
Strange, but I don't see double memory usage.
The text was updated successfully, but these errors were encountered: