-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cpu offloading #23
Conversation
768p, 10s: Max allocated memory: 10.431GiB 768p, 5s: Max allocated memory: 10.432GiB 384p, 5s: Max allocated memory: 10.432GiB Please note, nvidia-smi/nvtop will likely give incorrect vram usage amounts, as this version seems to use less vram if less vram was available (I was able to run 768p 10s 24fps with only 12gb available on my 24gb card). This seems to be reserved memory instead of allocated memory. |
Thanks for your great efforts for CPU offloading! We will merge it to the main branch. |
Nice thanks, maybe my 12GB 4070 can run this now. |
I have RTX 3060 and RTX 4070 with 12GB each running Ubuntu Linux. My RTX 4070 is my X display server so that takes about 1GB VRAM. So 12GB VRAM seems to be the current minimum. Not a problem, just an observation. Also, as another data point, for the 10 second 768P video, time is reported as 3261 for the RTX 3060 and 1510 for the RTX 4070 |
How do we activate it ? I tried running app.py -cpu_offloading=True , tried img to video and text to video, and it goes up to 24GB of used VRAM (I have 16GB so it uses shared memory which is super slow : it announces around 3h45 for the default test of the text to video tab :( ) |
It should be as simple as modifying cpu_offload=False to cpu_offload=True in app.py on lines 99 and 121 If this doesn't work, make sure you are using a more recent commit with the torch.cuda.empty_cache() lines merged, as I suspect your issue might be from cached vram not being deallocated as inference continues due to system memory fallback being available. |
Oh it's in the app.py ? I changed it in the changed file from the commit lol. (dit) Warning: Do not preload pipeline components (i.e. to cuda) with cpu offloading enabled! Otherwise, a second transfer will occur needlessly taking up time. Edit : step 1 completed and now VRAM jumped to 27GB |
No more warning, but still a jump to 20GB at the end of step 1, then after a couple of minutes it goes to step 2 and jump to 27GB of VRAM (RAM usage goes from 32GB down to 23GB in the same time, not sure if it helps). "I'll make a PR for dealing with CPU offloading in app.py better tomorrow/today (after my next sleep)" My hardware btw :
|
Should be fixed in #76 feifeiobama also got to simplifying cpu_offloading in app.py before me, thanks! |
This adds CPU offloading, allowing 768p 10s 24fps to run on a single 3090 (And likely within 12 GB too. If you have a 12 GB gpu, please let me know if it works!)
With these changes, inference can be run like in the following example (768p, 5s):
This makes inferencing take longer as modules' locations have to be changed, but also allows them to be run in far less vram.
One possible downside is that it's not possible to specify the load device. I can add another parameter if needed, though I didn't want to complicate the pipeline too much. i2v should also be supported with this method, but I haven't added it there or tested it.
I'll have exact memory allocations and timings shortly.