-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clear memory after offload #2994
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! @asomoza could you check if this branch of accelerate helps with the additional memory issue we were seeing with cpu offloading in FLUX?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
yeah, there's definitely an improvement but sadly not with flux which I suspect is a problem with us and not accelerate. But I tested it with SDXL and I can clearly see the difference: First part is with the fix which never gets over 8GB of VRAM, second part I commented the fix and then the VRAM consumption goes over 14GB of VRAM. |
Thanks. Did it help at all with Flux? I guess then the next step for us would be record the memory invocations and compare it to ComfyUI to note the points of differences? I think would be nice to do and fix on a priority, wdyt? |
yeah, as discussed internally, this also helps with Flux when not using the quantized transformer and loaded after, so overall this helps all the pipelines to be more efficient with the VRAM usage. |
Nice ! Then, I guess we can safely merge this PR ! cc @muellerzr |
Yes! |
Thanks for the PR @SunMarc. We recently added CogVideoX to Diffusers and were facing some issues with total memory required for running inference. This PR seems to address those issues as well. The following is the code used for inference: Codeimport gc
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
def flush():
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_peak_memory_stats()
def bytes_to_giga_bytes(bytes):
return f"{(bytes / 1024 / 1024 / 1024):.3f}"
flush()
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
pipe = CogVideoXPipeline.from_pretrained("/raid/aryan/CogVideoX-trial", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
torch.cuda.empty_cache()
memory = bytes_to_giga_bytes(torch.cuda.memory_allocated())
max_memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
max_reserved = bytes_to_giga_bytes(torch.cuda.max_memory_reserved())
print(f"{memory=}")
print(f"{max_memory=}")
print(f"{max_reserved=}")
export_to_video(video, "output.mp4", fps=8) Without With With
which is consistent with the original implementation as reported here. Thanks again :) |
the quick progress here is amazing, i would love if this could be merged soon, looking forward to use CogVideo on 24gb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
What does this PR do ?
This PR clears the memory in
CpuOffload
after we offload the previous module to thecpu
( used a lot incpu_offload_with_hook
). This helps a lot with cpu offloading in diffusers as this is more efficient with the VRAM usage.cc @sayakpaul