-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Simple option to set process affinity as a number of cores to use #447
Comments
Does setting the process to high priority affect this? because that is already an option |
@aleksusklim can i ask, what are your launch parameters? how many --threads did you start the program with? |
Correct me if I'm wrong, but I think it is impossible to use both command-line options and still have GUI to show up and take effect? For example, if I want to use an option that is not exists in GUI, then I'll have to specify every single option along with it? That previously I set via GUI. Back to your question: in my tests above, I left everything at defaults (except for model and streaming). I think it defaulted to 9 threads for some reason. As for my usual set-up, I set for 16 thread and FFFF affinity, so leaving 1 thread for 1 virtual performant core, dropping 4 efficient cores for other applications. Personally, I don't see the point of "1 thread per 1 physical core" (instead of virtual, so I should have using 8 cores instead of 16). I googled, and that offloading is a common problem. I saw two recommendations:
|
I have added what I hope will be a decent solution to this issue. You can now specify a launcher parameter |
I tried it. The behavior is this:
I find these things VERY confusing for end-user, almost looking like bugs. Also, the user could accidentally click inside popped console and "pause" is with "text selection" feature, which would lead to users creating Issues like "koboldcpp randomly freezes" (I saw a lot of those in repos where the main program works as console process that shows its interface in browser). The only viable use-case is headless sessions, where nobody works on the desktop of user who owns koboldcpp process. I didn't yet tested, would forced-foreground eliminate e-cores offloading with locked workspace or not, e.g.: I connect RDP, I open koboldcpp, I somehow share URL to other machine, I close RDP – then the host will be locked but operational. (To test this I would need to measure actual speed, since I won't be able to see task manager window obviously). Instead of convincing you that just setting affinity is better because it eliminates these issues, I decided to measure speed between "full affinity" and "only p-cores affinity". I mean, even in foreground – I see my efficient cores also loaded during generation. But I have a feeling that if the process would be forcefully restricted to performant cores only – the generation speed would be slightly higher (because E-core are really slow compared to P-cores!) I started my tests and… it crashed? Here are full logs from console:
I don't know what happened and why. If you need more info, my config or my test history – I will provide it too. |
Yeah, the idea of --foreground was mainly aimed at headless operation, since you mentioned using an RDP session and connecting remotely, it would ensure the application always receives priority during generation. The main reason why I don't want to add |
That's why I explicitly stated that you don't need to provide full affinity support! Just "Use first N cores", with mathematically crafted mask so that only those cores are selected:
(I used online affinity mask calculator, like https://bitsum.com/tools/cpu-affinity-calculator/) I think this would suffice any E-Core enabled processor, provided the user knows just how many performant virtual cores he has. I think you can give a rule of thumb: "limit used cores to double of your thread count" (so that there will be at least twice as many virtual processors, making sure all physical ones are selected). Since you already suggest setting thread count to half of all cores – this would mean "use everything", BUT in case the user knows about performant cores, or if he specifically wants to, for example, free one physical core (f.e. of 16) – he might use "cores/2-1" (= 7 of 16) for thread count and "(cores/2-1)*2" (= 14 of 16) for core count. Just setting less threads to completely free the core is not enough: Windows still share threads between all cores (except for the case when if offloads to efficient when idle). |
That still won't work correctly on the system I described previously (2 P-Cores and 8 E-Cores), in which case you do want to use more than just the 2 P-Cores only. I think the actual problem you may be encountering is your OS CPU scheduler, which is too aggressive at throttling. Are you running on some sort of power-saving or energy-saving scheme? Because most of the advice I've come across is to allow the OS to handle this kind of thing. In either case, advanced users are, like you mentioned, able to use /AFFINITY when launching the executable on their own systems. I would like to get some feedback from other people on how many are facing this issue, and what they think of this approach. |
So what's the problem for user to set "2" and use only P-Cores available? |
I did some further testing on affinity in the offloaded state:
This concludes that the only solution is to prevent the process from touching any E-core. |
It's worth to check, will 16+4 cores work better than 16+0 cores, in different modes. |
I have not encountered any crashes recently |
I've tested version 1.46.1 and that crash is no longer here, great. Now I can compare performance of different thread count against different affinity… First: CLBlast with 41/41 layers offloaded (13B model, context is set to 8k) to RTX 3060.
Second: OpenBLAS on CPU.
Observations:
Since there is no performance gain when using LESS cores than "all minus efficient", the only reason why the user would want to do this – is to "free a core", but I cannot tell when exactly this might be needed. Clearly, E-cores have negative impact on performance even when no full offloading happen!
Instead of giving the direct control of process affinity, you might implement a checkbox like "Do not use Intel E-cores", if you could programmatically detect P-cores to set affinity only to them. |
Got 6 cores, 2 hardware threads per each. Without an affinity mask, the threads setting currently doesn't even do anything meaningful because of the rescheduling. P. S.To automate optimal affinity mask generation, hwloc can be used as a component to query cache topology. P. S. S.Wrote reference Ruby code for calculating bitmasks from arrays. Codemodule Thread_affinity_bitmask
def thread_affinity_bitmask(indices) # Accepts an array of 0-offset integer thread indices.
bitmask = 0
indices.each do |core|
bitmask = 1 << core | bitmask # Bitshift, then bitwise OR.
end
return bitmask # Returns a numerical thread affinity bitmask.
end
# Print helper. Vararg parameter.
def print_hex_bitmask(*indices)
puts thread_affinity_bitmask(indices.map(&:pred)).to_s(16).upcase()
end
end Demo# Require or concat preceding module definition.
include Thread_affinity_bitmask # Think of this as using a namespace.
# Examples printed with 1-offset integer thread indices passed.
print_hex_bitmask(1, 3, 5, 7, 9, 11)
print_hex_bitmask(2, 4, 6, 8, 10, 12)
print_hex_bitmask(1, 3, 5, 7, 9, 11, 13, 15, 17, 19)
print_hex_bitmask(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
print_hex_bitmask(1)
print_hex_bitmask(1, 2)
print_hex_bitmask(1, 2, 3)
print_hex_bitmask(1, 2, 3, 4)
print_hex_bitmask(1, 2, 3, 4, 5)
print_hex_bitmask(1, 2, 3, 4, 5, 6)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7, 8)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7, 8, 9)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
print_hex_bitmask(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
print_hex_bitmask(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
print_hex_bitmask(3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
print_hex_bitmask(4, 5, 6, 7, 8, 9, 10, 11, 12)
print_hex_bitmask(5, 6, 7, 8, 9, 10, 11, 12)
print_hex_bitmask(6, 7, 8, 9, 10, 11, 12)
print_hex_bitmask(7, 8, 9, 10, 11, 12)
print_hex_bitmask(8, 9, 10, 11, 12)
print_hex_bitmask(9, 10, 11, 12)
print_hex_bitmask(10, 11, 12)
print_hex_bitmask(11, 12)
print_hex_bitmask(12) Output:
|
I have Intel Core i7-12700K (on Windows 10), it has 8 main «Performance» cores with hyperthreading, and 4 energy «Efficient» cores, giving in total 16+4=20 virtual cores.
The problem is, if I just run koboldcpp.exe as-is, then after some time, Windows will put its background process to 4 efficient cores (last ones, from 17 to 20). Yes, it is putting all threads of the process from main 16 cores to just 4 efficient cores, no matter how many threads I'll set.
Performance is awful until I put the console window to the foreground, and then boom – all main cores will have 100% load with coolers instantly speeding up!
The solution is to set "process affinity" (for example in Task Manager) for the koboldcpp process, leaving only 16 first cores for it.
Or to start the executable with .bat-file with something like
start "koboldcpp" /AFFINITY FFFF koboldcpp.exe
But now I think that other people might have this problem too, and it is very inconvenient to use command-line or task manager – because you have such great UI with the ability to load stored configs!
You can add an option, named something like "max number of cores to use:", which (if not zero) should set the process affinity mask to this number of cores, starting from 0 core. (I believe, efficient cores are always at the end, right?)
You can make a tooltip explaining that it is beneficial to put there a number of "powerful" virtual cores to increase performance; or to specifically limit the used cores by koboldcpp to leave them for other CPU-intensive applications (which plays nicely along with limiting thread count that you already have in GUI).
Do not add full affinity mask support, because most users would not understand how to set it; while those who can – they as well can start in command-line with any desired affinity.
The text was updated successfully, but these errors were encountered: