Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clear memory after offload #2994

Merged
merged 3 commits into from
Aug 9, 2024
Merged

clear memory after offload #2994

merged 3 commits into from
Aug 9, 2024

Conversation

SunMarc
Copy link
Member

@SunMarc SunMarc commented Aug 6, 2024

What does this PR do ?

This PR clears the memory in CpuOffload after we offload the previous module to the cpu ( used a lot in cpu_offload_with_hook ). This helps a lot with cpu offloading in diffusers as this is more efficient with the VRAM usage.

cc @sayakpaul

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! @asomoza could you check if this branch of accelerate helps with the additional memory issue we were seeing with cpu offloading in FLUX?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@asomoza
Copy link
Member

asomoza commented Aug 6, 2024

yeah, there's definitely an improvement but sadly not with flux which I suspect is a problem with us and not accelerate. But I tested it with SDXL and I can clearly see the difference:

image

First part is with the fix which never gets over 8GB of VRAM, second part I commented the fix and then the VRAM consumption goes over 14GB of VRAM.

@sayakpaul
Copy link
Member

Thanks. Did it help at all with Flux? I guess then the next step for us would be record the memory invocations and compare it to ComfyUI to note the points of differences? I think would be nice to do and fix on a priority, wdyt?

@asomoza
Copy link
Member

asomoza commented Aug 7, 2024

yeah, as discussed internally, this also helps with Flux when not using the quantized transformer and loaded after, so overall this helps all the pipelines to be more efficient with the VRAM usage.

@SunMarc SunMarc requested a review from muellerzr August 7, 2024 11:37
@SunMarc
Copy link
Member Author

SunMarc commented Aug 7, 2024

Nice ! Then, I guess we can safely merge this PR ! cc @muellerzr

@sayakpaul
Copy link
Member

Yes!

@a-r-r-o-w
Copy link
Member

Thanks for the PR @SunMarc. We recently added CogVideoX to Diffusers and were facing some issues with total memory required for running inference. This PR seems to address those issues as well. The following is the code used for inference:

Code
import gc

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video


def flush():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.reset_peak_memory_stats()


def bytes_to_giga_bytes(bytes):
    return f"{(bytes / 1024 / 1024 / 1024):.3f}"


flush()

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)

pipe = CogVideoXPipeline.from_pretrained("/raid/aryan/CogVideoX-trial", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]

torch.cuda.empty_cache()
memory = bytes_to_giga_bytes(torch.cuda.memory_allocated())
max_memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
max_reserved = bytes_to_giga_bytes(torch.cuda.max_memory_reserved())
print(f"{memory=}")
print(f"{max_memory=}")
print(f"{max_reserved=}")

export_to_video(video, "output.mp4", fps=8)

Without pipe.enable_model_cpu_offload(), the overall reserved memory allocated was ~33 GB (14 GB for models and remaining for denoising and decoding) but peak memory was ~22 GB. So, if we disabled cache allocations with PYTORCH_NO_CUDA_MEMORY_CACHING=1, the pipeline would run in about 22 GB, but the inference is extremely slow.

With pipe.enable_model_cpu_offload() and accelerate:main, the overall reserved memory was 27 GB. Still not ideal but better.

With pipe.enable_model_cpu_offload() and this branch, we get:

memory='0.008'
max_memory='10.805'
max_reserved='18.061'

which is consistent with the original implementation as reported here. Thanks again :)

cc @zRzRzRzRzRzRzR

@user425846
Copy link

the quick progress here is amazing, i would love if this could be merged soon, looking forward to use CogVideo on 24gb

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@SunMarc SunMarc merged commit 12a5bef into main Aug 9, 2024
28 checks passed
@SunMarc SunMarc deleted the test-clear-memory-cpu-offload branch August 9, 2024 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants