-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Add System-level optimization for CPU inference to wiki #10514
Comments
The wiki is editable by anyone. #10180 |
Hi @LynxPDA ! Could you please explain a bit what options does and how to tweak for other CPU capabilities? Thx |
Hi, @devingfx jmalloc is a memory allocator. background_thread Enabling jemalloc background threads generally improves the tail latency for application threads, since unused memory purging is shifted to the dedicated background threads. In addition, unintended purging delay caused by application inactivity is avoided with background threads. Suggested: background_thread:true when jemalloc managed threads can be allowed. metadata_thp Allowing jemalloc to utilize transparent huge pages for its internal metadata usually reduces TLB misses significantly, especially for programs with large memory footprint and frequent allocation / deallocation activities. Metadata memory usage may increase due to the use of huge pages. Suggested for allocation intensive programs: metadata_thp:auto or metadata_thp:always, which is expected to improve CPU utilization at a small memory cost. dirty_decay_ms and muzzy_decay_ms Decay time determines how fast jemalloc returns unused pages back to the operating system, and therefore provides a fairly straightforward trade-off between CPU and memory usage. Shorter decay time purges unused pages faster to reduces memory usage (usually at the cost of more CPU cycles spent on purging), and vice versa. Suggested: tune the values based on the desired trade-offs. More details on tuning and each of the parameters can be found at https://github.com/jemalloc/jemalloc/blob/dev/TUNING.md And it is this setting that gives the maximum performance increase on the CPU. As for libiomp5.so, it allows you to optimize parallel processing on the CPU. And this optimization also gives a small gain, but much less than with jmalloc. I think these same settings will work for most other processors. |
Hi, thx for this quick reply! After more in depth reading, there is a typo on line ending: dirty_decay_ms: 60000,muzzy_decay_ms:> ... looks like a nano copy/paste with line truncated ^^; I meant also for NUM_THREADS part... I don't know how to figure out my potato PC capabilities... I know I don't have a GPU...
|
Is the intel-mkl package needed for non-intel CPU? |
I get :
I tried with muzzy_decay_ms:60000, same errors... |
I'm sorry, you're absolutely right, the line is truncated at the end when copying from the nano editor. Corrected. The correct option is below. export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000" |
Yes, intel-mkl can work with AMD processor too. sudo apt-get install libomp-dev |
You didn't answer this... Looks like setting muzzy/dirty to 30000 bypass the Invalid conf error, but still I have the ls.so error above... |
Another subject: Did you get ControlNet to work on CPU? Looks like preprocessors are using GPU (I have a cuda out of memory error) despite the --use-cpu all param... |
Yes, the ControlNet on the CPU worked for me without problems with the command line arguments: If you are not using CUDA at all, you can install a nightly build of Pytorch for CPU only in venv. |
Regarding: As I wrote earlier, you can try to install the missing library separately, without installing intel-mkl using the command below
|
Hi, thanks for your optimizations ! |
Hi all! On my side I could not get everthing working on CPU, I have got CUDA out of memory errors really often (even if it should not use CUDA on CPU only isn't it?) on image generation (that I restart and it work the 2nd time), I can not use (what I analysed) the "side models" like RealESRGAN, ControlNet preprocessing or faceswap by example... I installed A1111 with default config the 1st time, then I tweaked webui-user.sh afterward (nottably with Also, maybe an issue with my hardware, I do have a GPU but a poor 2Go RAM so I want to use CPU, but maybe there are some automatic detections that found a GPU and want to use it ? PS: also after a recent "update all but you don't know what going on" button click in extension tab, image generation stopped working :( , I planned to do fresh reinstall, it is why I asked if I should go to a special process this time? |
Please clarify with what parameters the results of 30s/it and 13s/it were obtained? Samplers, steps, resolution and etc. The size of free RAM affects the maximum resolution of the generated more than the speed of generation. As part of the optimization, you can try the following actions:
|
I can suggest as an option - install a nightly build of Pytorch for CPU only in your virtual environment. For example, step by step:
|
Hi! Thx a lot for helping newbies, you rocks!
|
so far i can follow this on windows (not linux as most directions go) but how do i install jemmaloc libomp mkl etc for optimization like in the first post? |
@LynxPDA When I updated this repository to version 1.7, the method failed and could not accelerate normally. Do you have this problem? |
No, unfortunately there is no way to check this now. These parameters have a greater effect on working with memory than on the program itself; perhaps there are some problems with the lines of the multithreading libraries:
Disabling them will slightly reduce the acceleration. |
hey can anyone help me, i am using stable diffusion in a virtual GPU which is vast.ai, and i am not getting all of the API endpoints like txt2img etc.. etc.. even i set command line arg =---api in .bat file and and in webui-user.sh exported commandline_arg ---api |
It can't work in my computer. -` lalala@lalala-arch
.o+` ------------------
`ooo/ OS: Arch Linux x86_64
`+oooo: Kernel: 6.8.7-arch1-1
`+oooooo: Uptime: 7 days, 54 mins
-+oooooo+: Packages: 2421 (pacman)
`/:-:++oooo+: Shell: zsh 5.9
`/++++/+++++++: Resolution: 1920x1080
`/++++++++++++++: DE: Plasma 6.0.4
`/+++ooooooooooooo/` WM: KWin
./ooosssso++osssssso+` Theme: Breeze [GTK2/3]
.oossssso-````/ossssss+` Icons: breeze [GTK2/3]
-osssssso. :ssssssso. Terminal: yakuake
:osssssss/ osssso+++. CPU: Intel Xeon E3-1240L v5 (8) @ 3.200GHz
/ossssssss/ +ssssooo/- GPU: NVIDIA GeForce GT 730
`/ossssso+/:- -:/+osssso+- Memory: 24980MiB / 32047MiB
`+sso+:-` `.-/+oso:
`++:. `-/+/
.` `/ ./webui.sh --use-cpu all --skip-torch-cuda-test --no-half --precision full --opt-split-attention --listen --no-hashing --enable-insecure-extension-access
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:21<00:00, 10.22s/it] export LD_PRELOAD=/usr/lib/libjemalloc.so:$LD_PRELOAD
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000"
./webui.sh --use-cpu all --skip-torch-cuda-test --no-half --precision autocast --opt-split-attention --listen --no-hashing --enable-insecure-extension-access
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:21<00:00, 10.20s/it] export OMP_NUM_THREADS=8
export MKL_NUM_THREADS=8
export LD_PRELOAD=/usr/lib/libjemalloc.so:$LD_PRELOAD
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000"
export LD_PRELOAD=/opt/intel/oneapi/lib/intel64/libiomp5.so:$LD_PRELOAD
./webui.sh --use-cpu all --skip-torch-cuda-test --no-half --precision autocast --opt-split-attention --listen --no-hashing --enable-insecure-extension-access
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:23<00:00, 10.40s/it]
|
@lalala-233 |
Yeah, there are many differences between Archlinu and Ubuntu. % pacman -Ss jemalloc
extra/jemalloc 1:5.3.0-3 [installed]
General-purpose scalable concurrent malloc implementation
% pacman -Ss intel-mkl
extra/intel-oneapi-mkl 2023.2.0_49495-2 [installed]
Intel oneAPI Math Kernel Library After I installed these packages, I find their location different from Ubuntu. % locate libjemalloc.so
/usr/lib/libjemalloc.so
/usr/lib/libjemalloc.so.2
% locate libiomp5.so
/opt/intel/oneapi/compiler/2023.2.0/linux/compiler/lib/intel64_lin/libiomp5.so
/opt/intel/oneapi/lib/intel64/libiomp5.so I think I installed the true packages, but...
This barely changed on my computer. Perhaps it will be more effective when generating larger resolutions. I read the link you mentioned.
However, webui itself uses tcmalloc as the memory allocator, so the gain from switching to jemalloc is likely to be limited. |
@lalala-233
p.s. Yes, tcmalloc is used, but it's all about specific memory management settings. |
will anyone address my issue..? |
@VeeDel This issue - optimization for CPU inference. Perhaps you should find a similar issue or create a new one. |
If I add NO_TCMALLOC="True" ', the performance is slightly better, but still worse than not adding any parameters. |
I need help installing in amd can you please tell me how should i install on windows |
Is there an existing issue for this?
What would your feature do ?
The work on the CPU can be quite long.
Using some system optimizations, borrowed from HuggingFace, it turned out to increase the speed of work from 1.25x to 1.5x.
For my inference:
Proposed workflow
I added the following lines to the end of the webui-user.sh file:
export OMP_NUM_THREADS=16
export MKL_NUM_THREADS=16
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so:$LD_PRELOAD
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000"
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libiomp5.so:$LD_PRELOAD
Having previously installed
Additional information
Other system informations:
COMMANDLINE_ARGS="--precision autocast --use-cpu all --no-half --opt-channelslast --skip-torch-cuda-test --enable-insecure-extension-access"
python: 3.10.6 • torch: 2.1.0.dev20230506+cpu • xformers: N/A • gradio: 3.28.1 • commit: 5ab7f213 • checkpoint: b4391b7978
OS Ubuntu 22.04
The text was updated successfully, but these errors were encountered: