Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Simple option to set process affinity as a number of cores to use #447

Open
aleksusklim opened this issue Sep 22, 2023 · 15 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@aleksusklim
Copy link

aleksusklim commented Sep 22, 2023

I have Intel Core i7-12700K (on Windows 10), it has 8 main «Performance» cores with hyperthreading, and 4 energy «Efficient» cores, giving in total 16+4=20 virtual cores.

The problem is, if I just run koboldcpp.exe as-is, then after some time, Windows will put its background process to 4 efficient cores (last ones, from 17 to 20). Yes, it is putting all threads of the process from main 16 cores to just 4 efficient cores, no matter how many threads I'll set.
Performance is awful until I put the console window to the foreground, and then boom – all main cores will have 100% load with coolers instantly speeding up!

The solution is to set "process affinity" (for example in Task Manager) for the koboldcpp process, leaving only 16 first cores for it.
Or to start the executable with .bat-file with something like start "koboldcpp" /AFFINITY FFFF koboldcpp.exe

But now I think that other people might have this problem too, and it is very inconvenient to use command-line or task manager – because you have such great UI with the ability to load stored configs!
You can add an option, named something like "max number of cores to use:", which (if not zero) should set the process affinity mask to this number of cores, starting from 0 core. (I believe, efficient cores are always at the end, right?)

You can make a tooltip explaining that it is beneficial to put there a number of "powerful" virtual cores to increase performance; or to specifically limit the used cores by koboldcpp to leave them for other CPU-intensive applications (which plays nicely along with limiting thread count that you already have in GUI).

Do not add full affinity mask support, because most users would not understand how to set it; while those who can – they as well can start in command-line with any desired affinity.

@LostRuins
Copy link
Owner

Does setting the process to high priority affect this? because that is already an option

@aleksusklim
Copy link
Author

I can confirm that this all-threads-on-effictient-cores behavior is in effect for me even with high priority checkbox on version 1.44.1 at default setting.

Here are screenshots of task manager, made at normal priority:
before
after
You can see the process used only last 4 cores. Then I've switched to the koboldcpp console window – and performance physical cores started working.

It is hard to trigger it right away. I had around 10 unsuccessful attempts before, when everything was already at main cores from the start. (For some reason, I got more success when using "stream more", maybe because the browser in foreground was doing constant rendering of text).

I have more luck in triggering this by using RDP to target machine, and leaving it for a long generation. Then, when I check why it hasn't finished yet – I almost always see loaded efficient cores (and realizing that I forgot to set the affinity again!)

When I tried ticking "high priority" in GUI and leaving it now – after some time, my RDP session was disconnected and I couldn't login back; then I tried to login locally on the physical machine – it lagged badly, then I saw 0% load on main cores and constant 100% load on all 4 efficient cores, and then – system hanged completely and I had to reboot.
(For me it was like Windows decided to put in efficient state ALL of the processes including system ones, but because koboldcpp was at high priority – nothing could run there in parallel anymore…)

@LostRuins LostRuins added the enhancement New feature or request label Sep 23, 2023
@LostRuins
Copy link
Owner

@aleksusklim can i ask, what are your launch parameters? how many --threads did you start the program with?

@aleksusklim
Copy link
Author

aleksusklim commented Sep 25, 2023

what are your launch parameters?

Correct me if I'm wrong, but I think it is impossible to use both command-line options and still have GUI to show up and take effect?

For example, if I want to use an option that is not exists in GUI, then I'll have to specify every single option along with it? That previously I set via GUI.
Because of that, I don't use command-line launch anymore: I set-up my preferred settings in GUI for each model that I want to use (including path) – and then I save the config. So, all I have to do is just load a config that I want – and tweak it for current launch (for example, disabling GPU if my VRAM is busy with other tasks right now).

Back to your question: in my tests above, I left everything at defaults (except for model and streaming). I think it defaulted to 9 threads for some reason.

As for my usual set-up, I set for 16 thread and FFFF affinity, so leaving 1 thread for 1 virtual performant core, dropping 4 efficient cores for other applications.
When I have GPU-intensive background task, I set threads to 15 and affinity to first 15 cores (by checkboxes in Task Manager), so that one virtual performant core is free from koboldcpp.

Personally, I don't see the point of "1 thread per 1 physical core" (instead of virtual, so I should have using 8 cores instead of 16).
Still, the issue of offloading to efficient cores is in effect regardless of how many threads koboldcpp is using. (Maybe it changes the offloading probability, but does not eliminating it),

I googled, and that offloading is a common problem. I saw two recommendations:

  1. Put Windows in "maximal performance" in Power settings of Control panel. – I did this, but it gave nothing for me for some reason.
  2. Disable efficient cores completely in BIOS. – I do not want to do that yet, I like that when nothing to do – all my processes are using low-frequency processing, very quiet and power-friendly.

@LostRuins
Copy link
Owner

I have added what I hope will be a decent solution to this issue. You can now specify a launcher parameter --foreground that will bring the terminal console onto the foreground every time a new generation is started. This should hopefully prevent Windows from using E-cores instead of P-cores. Please try it out!

@aleksusklim
Copy link
Author

I tried it.

The behavior is this:

  • After resuming from hibernation (for example; to be sure E-cores kick in) and initiating generation – the taskbar icon of koboldcpp flashes, but does not bring its window to foreground. Generation is processed on efficient cores, slowly. But this is only in the first time!
  • The next time, the second generation (when the taskbar icon is already flashed enough) bring the console window on top of everything. Processing switches to P-cores and the speed is high.
  • If the windows is minimized – it would either flash once, or restore its position right away, being no longer minimized.
  • If the window is send to another virtual desktop (Win+Tab, "add desktop", etc.) then the generation would either bring its icon to THIS taskbar and flash it here, or switch everything to THAT desktop, hiding all windows from the current desktop. (This is normal behavior of forced foreground, any application would do it, actually).
  • When the console appears on foreground, it steals keyboard focus (even if it is moved almost off-screen). This means, for example, accidental Ctrl+C will kill it instead of copying highlighted part of history from the opened browser.

I find these things VERY confusing for end-user, almost looking like bugs. Also, the user could accidentally click inside popped console and "pause" is with "text selection" feature, which would lead to users creating Issues like "koboldcpp randomly freezes" (I saw a lot of those in repos where the main program works as console process that shows its interface in browser).

The only viable use-case is headless sessions, where nobody works on the desktop of user who owns koboldcpp process. I didn't yet tested, would forced-foreground eliminate e-cores offloading with locked workspace or not, e.g.: I connect RDP, I open koboldcpp, I somehow share URL to other machine, I close RDP – then the host will be locked but operational. (To test this I would need to measure actual speed, since I won't be able to see task manager window obviously).

Instead of convincing you that just setting affinity is better because it eliminates these issues, I decided to measure speed between "full affinity" and "only p-cores affinity". I mean, even in foreground – I see my efficient cores also loaded during generation. But I have a feeling that if the process would be forcefully restricted to performant cores only – the generation speed would be slightly higher (because E-core are really slow compared to P-cores!)

I started my tests and… it crashed?
I reproduced the crash several times, it looks rather confident.

Here are full logs from console:
***
Welcome to KoboldCpp - Version 1.45.2
For command line arguments, please refer to --help
***
Attempting to use OpenBLAS library for faster prompt ingestion. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas.dll
==========
Overriding thread count, using 12 threads instead.
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=14, config=None, contextsize=8192, debugmode=False, forceversion=0, foreground=True, gpulayers=41, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/NN/GPT/GGML/mythalion-13b.Q5_K_M.gguf', multiuser=False, noavx2=False, noblas=False, nommap=False, onready='', port=5001, port_param=5001, psutil_set_threads=True, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, stream=True, tensor_split=None, threads=12, unbantokens=True, useclblast=None, usecublas=None, usemirostat=None, usemlock=False)
==========
Loading model: C:\NN\GPT\GGML\mythalion-13b.Q5_K_M.gguf
[Threads: 12, BlasThreads: 14, SmartContext: False]

---
Identified as LLAMA model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling (scale:1.000, base:32000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from C:\NN\GPT\GGML\mythalion-13b.Q5_K_M.gguf (version GGUF V2 (latest))
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 8.60 GiB (5.67 BPW)
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.12 MB
llm_load_tensors: mem required  = 8801.75 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: freq_base  = 32000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 6400.00 MB
llama_new_context_with_model: compute buffer total size = 2749.89 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
WARNING: --unbantokens is DEPRECATED and will be removed soon! EOS unbans should now be set via the generate API.
WARNING: --stream is DEPRECATED and will be removed soon! This was a Kobold Lite only parameter, which is now a proper setting toggle inside Lite.
WARNING: --psutil_set_threads is DEPRECATED and will be removed soon! This parameter was generally unhelpful and unnecessary, as the defaults were usually sufficient
Please connect to custom endpoint at http://localhost:5001
Force redirect to streaming mode, as --stream is set.

Input: {"n": 1, "max_context_length": 8192, "max_length": 128, "rep_pen": 1.1, "temperature": 0.85, "top_p": 0.85, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "genkey": "KCPP7213", "prompt": "<|system|>Enter RP mode. Pretend to be Albert Einstein at his prime of life. You shall reply to the user while staying in character, and generate long responses.\n<|user|>Where were you born?\n<|model|>", "quiet": true, "stop_sequence": ["<|user|>", "<|model|>", "\n", "<", "|"], "use_default_badwordsids": false}

Processing Prompt [BLAS] (59 / 59 tokens)
Generating (25 / 128 tokens)
(Stop sequence triggered: <
>)
Time Taken - Processing:8.9s (151ms/T), Generation:5.2s (208ms/T), Total:14.1s (1.8T/s)
Output:  I was born on March 14, 1879, in Ulm, Württemberg, Germany.

Exception ignored in: <function Variable.__del__ at 0x000001D5C07E5700>
Traceback (most recent call last):
  File "tkinter\__init__.py", line 363, in __del__
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x000001D5C07E5700>
Traceback (most recent call last):
  File "tkinter\__init__.py", line 363, in __del__
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x000001D5C07E5700>
Traceback (most recent call last):
  File "tkinter\__init__.py", line 363, in __del__
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x000001D5C07E5700>
Traceback (most recent call last):
  File "tkinter\__init__.py", line 363, in __del__
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x000001D5C07E5700>
Traceback (most recent call last):
  File "tkinter\__init__.py", line 363, in __del__
RuntimeError: main thread is not in main loop

I don't know what happened and why. If you need more info, my config or my test history – I will provide it too.
I can test this on any other version if needed.

@LostRuins
Copy link
Owner

Yeah, the idea of --foreground was mainly aimed at headless operation, since you mentioned using an RDP session and connecting remotely, it would ensure the application always receives priority during generation.

The main reason why I don't want to add /AFFINITY directly is the mask required will be different for each CPU setup. You cannot just set /AFFINITY FFFF, that would simply enable the process to use the first 16 cores. You'd need to figure out which ones to enable and which to disable and that is different for every PC. There are cases where the system has only 2 P-Cores and another 8 E-Cores for example.

@aleksusklim
Copy link
Author

That's why I explicitly stated that you don't need to provide full affinity support!

Just "Use first N cores", with mathematically crafted mask so that only those cores are selected:

Cores = Mask
0 = -1 (all / disabled / do nothing)
1 = 1
2 = 3
3 = 7
4 = F
5 = 1F
6 = 3F
7 = 7F
8 = FF
9 = 1FF
10 = 3FF
11 = 7FF
12 = FFF
...

(I used online affinity mask calculator, like https://bitsum.com/tools/cpu-affinity-calculator/)

I think this would suffice any E-Core enabled processor, provided the user knows just how many performant virtual cores he has.
Why do you think the user would ever need to use "mask with holes"? To run "only on even-numbered virtual cores"? What would be the reason to do so, to use each physical hyperthreaded core only on half?
Is there ever any efficient cores that come before (or interleaved with) performant ones?

I think you can give a rule of thumb: "limit used cores to double of your thread count" (so that there will be at least twice as many virtual processors, making sure all physical ones are selected). Since you already suggest setting thread count to half of all cores – this would mean "use everything", BUT in case the user knows about performant cores, or if he specifically wants to, for example, free one physical core (f.e. of 16) – he might use "cores/2-1" (= 7 of 16) for thread count and "(cores/2-1)*2" (= 14 of 16) for core count.

Just setting less threads to completely free the core is not enough: Windows still share threads between all cores (except for the case when if offloads to efficient when idle).

@LostRuins
Copy link
Owner

That still won't work correctly on the system I described previously (2 P-Cores and 8 E-Cores), in which case you do want to use more than just the 2 P-Cores only.

I think the actual problem you may be encountering is your OS CPU scheduler, which is too aggressive at throttling. Are you running on some sort of power-saving or energy-saving scheme? Because most of the advice I've come across is to allow the OS to handle this kind of thing.

In either case, advanced users are, like you mentioned, able to use /AFFINITY when launching the executable on their own systems. I would like to get some feedback from other people on how many are facing this issue, and what they think of this approach.

@LostRuins LostRuins added the help wanted Extra attention is needed label Oct 5, 2023
@aleksusklim
Copy link
Author

the system I described previously (2 P-Cores and 8 E-Cores), in which case you do want to use more than just the 2 P-Cores only.

So what's the problem for user to set "2" and use only P-Cores available?

@aleksusklim
Copy link
Author

aleksusklim commented Oct 6, 2023

I did some further testing on affinity in the offloaded state:

  • Changing affinity does not make the process leave offloaded state.
  • As long as there is at least one available E-core – then no P-cores would be used.
  • So, there is no point in disabling "only the last E-core". Disabling other P-cores didn't help either, no matter even or odd ones.
  • After disabling all E-cores, every available P-core loads at 100% immediately.
  • Resetting affinity or allowing any E-core – drops all P-core load to 0%.
    I double-checked that my power settings are at "maximal performance" (except for display-off time).

This concludes that the only solution is to prevent the process from touching any E-core.
My approach of "tell how many first cores do you want to use" will work UNLESS there are processors where E-cores are interleaved with P-cores, or where E-cores come first.
Are there?

@aleksusklim
Copy link
Author

in which case you do want to use more than just the 2 P-Cores only.

It's worth to check, will 16+4 cores work better than 16+0 cores, in different modes.
What's about the crash? It didn't happen today when I used GPU. My previous test that crashed – was on CPU only.

@LostRuins
Copy link
Owner

I have not encountered any crashes recently

@aleksusklim
Copy link
Author

I've tested version 1.46.1 and that crash is no longer here, great. Now I can compare performance of different thread count against different affinity…

First: CLBlast with 41/41 layers offloaded (13B model, context is set to 8k) to RTX 3060.
All numbers are in ms/T from "Generation" time, the best (lowest) result from 5 attemps for each mode.

Thread count \ First N cores: 20/20 (all cores) 16/20 (P-cores) 8/20 (four physical) 4/20 (two physical)
4 threads 87 85 93 96
8 threads 89 85 90 94
16 threads 95 89 93 97
20 threads 96 90 95 98

Second: OpenBLAS on CPU.

Thread count \ First N cores: 20/20 (all cores) 16/20 (P-cores) 8/20 (four physical) 4/20 (two physical)
4 threads 262 214 304 374
8 threads 228 201 235 305
16 threads 207 205 228 294
20 threads 208 213 232 294

Observations:

  • No matter the thread count, using all P-cores is always better than using only some of them.
  • The overall picture is equal for CPU and GPU, but on GPU relative performance differences are negligible.
  • The best mode is "no E-cores + one thread per each physical processor".
  • Using E-cores along with P-cores is worse than using P-cores only, as long as there are less threads than total amount of virtual cores.

Since there is no performance gain when using LESS cores than "all minus efficient", the only reason why the user would want to do this – is to "free a core", but I cannot tell when exactly this might be needed.

Clearly, E-cores have negative impact on performance even when no full offloading happen!
For example, by default koboldcpp suggests to use 9 threads. Here is the speed at absolutely default config (2k context) on 13B model:

Thread count \ Process affinity All 20 cores Only 16 P-cores
9 threads (default from koboldcpp for 12700K) 230 ms/T 224 ms/T
8 threads (number of physical performant processors) 226 ms/T 200 ms/T

Instead of giving the direct control of process affinity, you might implement a checkbox like "Do not use Intel E-cores", if you could programmatically detect P-cores to set affinity only to them.
Looks like this would be enough!

@Erquint
Copy link

Erquint commented Oct 11, 2024

Got 6 cores, 2 hardware threads per each.
Running koboldcpp set to use 6 software threads somehow ends up utilizing all 12 of my hardware threads.
I'm guessing the Windows kernel scheduler is making a mess of it because of the lacking affinity mask, spreading the load, dubiously prioritizing computation rate over cache rationing.
To optimize for cache topology, I would much prefer to run one thread per each discrete core instead, since they share L2 cache.
This usually provides immense speedups in heavy multithreaded tasks in my experience.

Without an affinity mask, the threads setting currently doesn't even do anything meaningful because of the rescheduling.

P. S.

To automate optimal affinity mask generation, hwloc can be used as a component to query cache topology.
But the simplest approach is to just use every Nth hardware thread, given N is the count of hardware threads per core of the processor. Best to skip 0th thread in case of system reservation. Gets more complicated with processors that have different kinds of cores.

P. S. S.

Wrote reference Ruby code for calculating bitmasks from arrays.

Code
module Thread_affinity_bitmask
  def thread_affinity_bitmask(indices) # Accepts an array of 0-offset integer thread indices.
    bitmask = 0
    indices.each do |core|
      bitmask = 1 << core | bitmask # Bitshift, then bitwise OR.
    end
    return bitmask # Returns a numerical thread affinity bitmask.
  end
  
  # Print helper. Vararg parameter.
  def print_hex_bitmask(*indices)
    puts thread_affinity_bitmask(indices.map(&:pred)).to_s(16).upcase()
  end
end
Demo
# Require or concat preceding module definition.

include Thread_affinity_bitmask # Think of this as using a namespace.

# Examples printed with 1-offset integer thread indices passed.
print_hex_bitmask(1, 3, 5, 7, 9, 11)
print_hex_bitmask(2, 4, 6, 8, 10, 12)
print_hex_bitmask(1, 3, 5, 7, 9, 11, 13, 15, 17, 19)
print_hex_bitmask(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
print_hex_bitmask(1)
print_hex_bitmask(1, 2)
print_hex_bitmask(1, 2, 3)
print_hex_bitmask(1, 2, 3, 4)
print_hex_bitmask(1, 2, 3, 4, 5)
print_hex_bitmask(1, 2, 3, 4, 5, 6)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7, 8)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7, 8, 9)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
print_hex_bitmask(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
print_hex_bitmask(3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
print_hex_bitmask(4, 5, 6, 7, 8, 9, 10, 11, 12)
print_hex_bitmask(5, 6, 7, 8, 9, 10, 11, 12)
print_hex_bitmask(6, 7, 8, 9, 10, 11, 12)
print_hex_bitmask(7, 8, 9, 10, 11, 12)
print_hex_bitmask(8, 9, 10, 11, 12)
print_hex_bitmask(9, 10, 11, 12)
print_hex_bitmask(10, 11, 12)
print_hex_bitmask(11, 12)
print_hex_bitmask(12)

Output:

555
AAA
55555
AAAAA
FFFFF
1
3
7
F
1F
3F
7F
FF
1FF
3FF
7FF
FFF
FFE
FFC
FF8
FF0
FE0
FC0
F80
F00
E00
C00
800

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants