Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor CPU and GPU utilization in 1.62.2 compared to 1.61.2 for CuBLAS #786

Closed
aleksusklim opened this issue Apr 16, 2024 · 17 comments
Closed

Comments

@aleksusklim
Copy link

I downloaded CommandR+ and noticed very low CPU usage.
But now I realized that every model performs much worse!

BLAS stage is almost 5x slower in 1.62.2, as if thread count was ignored (again?)

For example, with 1.61.2 on my Core-i7 12700K + RTX 3060 I can ask for 16 threads and see 100% CPU load with 25% CUDA load for Mixtral 8x7b at 4k context using 35 Gb of RAM.

But the exact same setup with the new version brings only 20% CPU load and merely 5% CUDA load! I have 0 offloaded layers which was very good and more performant for large models.

Note:

  • This is not only for MoE but for any model (tested LLAMA2 and Yi)
  • Generation phase is not affected, only BLAS.
  • Disabling MMAP does not help.
  • This is not related to MMQ setting.
  • OpenBLAS and CLBlas look fine.
  • Reproducible at very default server/client settings too.
  • I have only Q4_K_M and Q5_K_M quants to test.

This is how CPU in Task Manager looks like:

(In the old version all 16 P-cores were maxed out for BLAS, but now they are becoming maxed only for actual generation phase)

Also: #780
Strange, but I don't see double memory usage.

@LostRuins
Copy link
Owner

Did you happen to update/change your driver or have a windows update during this time?

One major change between 1.61 and 1.62 was the introduction of ubatch parameter upstream (https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#batch-size) which in KoboldCpp defaults to the same value as used in --blasbatchsize. What value of --blasbatchsize are you using, and have you tried different values?

Another command you can try is --foreground which sends the app to the foreground, this can potentially help resolve thread scheduling issues if caued by E cores.

Lastly, can you see if you have this regression happening in base llama.cpp too?

@aleksusklim
Copy link
Author

This is not related to blas thread count (I've tried to set it to 16 too).
This is not a foreground issue, I've changed affinity manually.
This does not feel to be related to batching, but I haven't checked to different blas batch sizes; will do.
I don't think any OS issues may be related, because I can literally compare two versions side by side, without restarting my machine!

I'll check main.exe from upstream and will tell how it's working for me, thanks for the suggestion.

@aleksusklim
Copy link
Author

@LostRuins, can you give the exact commit of llama.cpp from which kobold-1.61.2 was built?

I've tested current vs. the one that everybody referred to (before model reconversion became needed for MoE), but did not see difference.
Then I've checked against a month-old one, and that old llama.cpp indeed maxes out CPU load but the actual BLAS speed turned out to be lower!
(For clarity, koboldcpp not only used less CPU, but really processed slowly)

How do we pinpoint this? If random older versions might be inherently slower, but for newer versions they say "hey, you have to reconvert models" – then I cannot honestly compare against old models anymore! But new models won't work in old version.

I'm not telling "make it as before", since if something was changed internally – then it was done for good. Rather:

  • Why CuBLAS cannot utilize CPU or GPU for 100% now? What is the reason if this was intended?

Also I've tested different blas batch sizes in koboldcpp and it seems the less batchsize is – the more load on CPU but with lower speed too. Without blas batching it maxes to 100% but goes really slow.

I tried to test light models with all offloaded layers – and it seem the speed is equal between 1.61 and 1.62, CPU was almost empty.
Can this mean that the problem is "partially offloaded model" (and especially "0 offloaded layers")?

@LostRuins
Copy link
Owner

1.61.2 is built from this KoboldCpp commit f3b7651

The last sync with upstream source for that build would be
ggerganov@19885d2

@aleksusklim
Copy link
Author

Something really strange is going on.

Firstly, I have downloaded llama-b2694-bin-win-cuda-cu11.7.1-x64.zip and tested it, seeing "bad" behavior with low-CPU usage.
But!!

If I don't touch the affinity, allowing the process to use all 16 virtual P-cores and 4 E-cores – then it performs much better!
It is not just better. If I look on Cuda graph in Task Manager for my GPU, it reads:

  • Affinity to first 16 cores = 5% CUDA
  • No affinity with -t 16 = 20% CUDA
  • No affinity with -t 8 (as my physical cores) = 40% CUDA

I remember once I tried to set affinity away from the very first two cores ("CPU 0") – and in that case (allowing koboldcpp to use cores from 2 to 15) my CUDA utilization was around 0%. As if "the main GPU controller thread" was not pushing the work.

But now moving main.exe away from last 4 cores drastically lowers GPU usage!
How could that be!?

Can this mean that only users with Intel Efficient cores could hit the problem?
And since nobody have responded in our dedicated thread – #447 (comment) – I conclude that there are no experienced users with E cores here…

By the way, can you tell me what llama.cpp release distribution archive should I use to get the most similar to koboldcpp behavior for CuBLAS? Maybe some specific command-line parameters?
I didn't quite understand how to "choose" CuBLAS for main.exe

@gustrd
Copy link

gustrd commented Apr 18, 2024

My impression is that the CPU cores 'crunch' the data so the GPU may be able to process it. If you use fewer cores, maybe the GPU doesn't have enough crunched input to work on.

I use a thread for each physical core; even so, the GPU may not reach 100%, probably because the memory speed inside it is another bottleneck.

I let the OS choose the core affinity; the hyperthreading usually allows me to use the CPU for common tasks with no downside, as the niceness of the koboldcpp process is higher (priority is lower).

@aleksusklim
Copy link
Author

  1. Previously, affinity to 20 cores (P+E) was always worse than affinity to 16 cores (P-only), no matter the thread count.
  2. Too many threads (16 instead of P/2=8) was a bit slower, but maxed out CPU usage to 100% anyway.
  3. When the process console window is on the background for long enough, Windows temporary forces its affinity to E-cores only (4) freeing my P cores (16). I guess all of 8 or 16 threads now have to sit on those 4 cores, maxing them to 100% constantly (but less than 20% or so in total), but performing VERY slow!
  4. The only reliable solution is to disable E-cores in UEFI (/BIOS). But I liked that koboldcpp eat all P-cores de-facto leaving everything else on E-cores where the rest of the system (including the browser rendering the generated text) can sit nicely without interfering with the generation.
  5. The less reliable solution is to periodically force foreground on process, and looks like I would have to try again to see how bad that is.

Why I am concerned: if I'm pretty sure that E-cores hurting the overall performance when used – then now the performance of all my 20 cores is most likely suboptimal compared to what it could theoretically be without this bug/feature that broke CuBLAS for me!

Even if now 20 cores for new version would be faster than 16 cores for older version (I actually need to re-benchmark everything to know for sure…)

@gustrd
Copy link

gustrd commented Apr 18, 2024

@aleksusklim , note: I'm using Linux; perhaps the OS affinity selector here is better than in Windows. It certainly behaves differently.

@LostRuins
Copy link
Owner

By the way, can you tell me what llama.cpp release distribution archive should I use to get the most similar to koboldcpp behavior for CuBLAS? Maybe some specific command-line parameters?
I didn't quite understand how to "choose" CuBLAS for main.exe

Yeah you cannot choose cublas for main.exe, instead you need to build it with the correct compile flags to enable cuda. For llama.cpp main, I believe they use the flag LLAMA_CUDA=1. You can see the commit to use by looking at when the last "merge" from the upstream branch was.

@aleksusklim
Copy link
Author

I was able to raise my CUDA utilization up to 60% in koboldcpp for CommandR+ by disabling E-Cores in UEFI settings altogether. This noticeably improved performance of 1.62 (haven't checked 1.63 yet) for all models!

Now I'm concerned of how to properly ablate this to find the best settings. Things to change for me could be:

  • Model type: LLAMA2, Yi, Mixtral
  • Context size: 4k, 16k, 64k
  • Batch size: 128, 512, 2048
  • Quant type: Q4_0, Q5_K_M, Q8_0
  • Layer offload: 0, some, max
  • Thread count: 4, 8, 16
  • BLAS algo: OpenBLAS, CLBlas, CuBLAS, Vulkan

This is too big experiment space! Instead, I think I have to fix a few things:

  • Pin model to CommandR+ only as the most promising one; or for Mixtral as it is more stable. Pin its quant for now too.
  • Pin the context size for a reasonable amount, assuming its linear influence on performance (meaning the longer context is equally slower no matter other things)
  • Test only two layer offloads: zero and max possible (since the model is super-big anyway)
  • Benchmark BLAS stage and generation stage separately of each-other for all tests.

So, I'll vary only these: algo, threads, batch – with and without offloading; for BLAS and for generation.
I'll pin E-cores as disabled; not sure whether disabling hyper-threading (leaving me with only 8 real cores) may also improve anything, probably I should test this beforehand.

@aleksusklim
Copy link
Author

Disabled hyperthreading and found no significant benefits with 8 cores and 8 threads compared to 16 cores and 8 threads.

Meanwhile, I also saw an interesting setting in UEFI overclocking tab: (I have MSI PRO Z680-A DDR4 motherboard)

  • Legacy Game Compatibility Mode: When enabled, Pressing the scroll lock key will toggle the E-Core between being parked when Scroll Lock LED is on and un-parked when LED is off.

I thought that this can be a good trade-off for me if I could just press scroll-lock when using koboldcpp.
But… CUDA utilization drops to 0% right away when I turn on scroll-lock! (And everything lags badly during toggling)
Even if I'll start koboldcpp with the LED already on.
Sigh.

@LostRuins
Copy link
Owner

That Game Compatibility Mode thing sounds super weird. Sounds like a driver bug... does that happen with other software? Might be the reason for your E-Core madness that nobody else really seems to have - a bad driver is messing with the thread scheduling maybe? Have you tried disabling the special features?

I'm glad I don't have a modern intel chip, these E cores sound like such a pain to deal with. Maybe I should just use AMD Ryzen in future.

@aleksusklim
Copy link
Author

aleksusklim commented Apr 22, 2024

There are a lot of "Intel Turbo this" and "Intel Technology that" in Advanced Overclocking settings, everything set on "auto" in some way.

I think it is possible to find a culprit by trial and errors, but generally disabling E-Cores did the trick, right?

By the way, when I loded CommandR+ (into v1.63) with 0 offloaded CuBLAS, the console prints something like "Out Of Memory, failed to allocate ~77Gb of shared memory" (not exact wording; I can write the full log if you wanna see), and I know with 128 Gb of RAM I can have only 64 Gb of "shared" GPU memory.

But otherwise it works fine…
Now I don't think that benching CommandR+ would be a good idea.

@aleksusklim
Copy link
Author

I decided to update my BIOS, lol.
This gave nothing, except that now I have both NVidia and my integrated Intel UHD Graphics 770 GPU available in Windows!
Previously, only NVIDIA GeForce RTX 3060 was visible, replacing the integrated card when installed; I don't know what UEFI setting controls this, but the BIOS update have cleared all CMOS settings at new defaults (even my VeraCrypt Bootloader was missing in EFI; I had to restore it via rescue-disk)

Still, E-cores perform worse than if disabled in UEFI but much better than with changed affinity when enabled.
This seem to affect only CuBLAS and Vulkan (hurts "3D" utilization graph), but not CLBlas: in CLBlas CUDA and CPU utilization is not lowering when I change affinity away from E-Cores.

Also, I thought the error/warning I mentioned above was caused by Use mlock checked, but it happens even without it (and it says "pinned", not "shared" as I said; also it is not happening with CLBlas or Vulkan):

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.37 MiB
ggml_cuda_host_malloc: warning: failed to allocate 72662.17 MiB of pinned memory: out of memory
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/65 layers to GPU
llm_load_tensors:        CPU buffer size = 72662.17 MiB

By the way, there is no way to use both Intel and NVidia simultaneously, right? I see only CuBLAS allows to use more than one GPU, but it obviously cannot detect the integrated one.

@aleksusklim
Copy link
Author

CPU utilization pattern during generation after BLAS stage for CuBLAS is different in newer versions.

For example, with 8 threads and 16 cores (disabled E-Cores) in version 1.63 I see this picture:
image

But the same config on the same model for 1.61.2 looks like this:
image

"Main" cores were loaded more than their hyperthreaded twins!

@LostRuins
Copy link
Owner

Possibly there were some modifications to the cuda kernels. But especially if you offload everything on GPU i think it defaults to one thread only.

@aleksusklim
Copy link
Author

I realized that if I would use low max context size for Mixtral, for example 8k instead of 64k – then I can have CUDA utilization up to 99% during BLAS provided I have disabled E-Cores in UEFI and do not change affinity.
Prompt ingestion is very fast…

Buy the way, I heard that Win10 handles E-Cores poorly, compared to Win11. But I won't try to set up Win11 to confirm that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet