Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AnimateDiff #5296

Closed
wants to merge 10 commits into from
Closed

AnimateDiff #5296

wants to merge 10 commits into from

Conversation

itsadarshms
Copy link

@itsadarshms itsadarshms commented Oct 5, 2023

What does this PR do?

This PR implements AnimateDiff as discussed in #4524

Status -> πŸ§‘β€πŸ’» WIP

Model/Pipeline Description

With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. Subsequently, there is a great demand for image animation techniques to further combine generated static images with motion dynamics. In this report, we propose a practical framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to insert a newly initialized motion modeling module into the frozen text-to-image model and train it on video clips to distill reasonable motion priors. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base T2I readily become text-driven models that produce diverse and personalized animated images. We conduct our evaluation on several public representative personalized text-to-image models across anime pictures and realistic photographs, and demonstrate that our proposed framework helps these models generate temporally smooth animation clips while preserving the domain and diversity of their outputs.

Tasks/TODO

  • Implement Text-to-Video AnimateDiff Pipeline
  • Implement Image-to-Video AnimateDiff Pipeline
  • Implement support for longer frame generation (> 16) as in animatediff-cli
  • Add Motion LoRA support
  • Write relevant test cases
  • Add/Update docstrings
  • Create Documentation
  • Add usage examples

Usage Examples

  • Stable Diffusion 1.5
import torch
from diffusers import UNet3DConditionModel, TextToVideoAnimateDiffPipeline,DDIMScheduler
from diffusers.utils import export_to_video

unet = UNet3DConditionModel.from_pretrained("itsadarshms/animatediff-v2-stable-diffusion-1.5", subfolder="unet", torch_dtype=torch.float16, use_safetensors=True)
pipe = TextToVideoAnimateDiffPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", unet=unet, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "close up photo of a rabbit, forest, haze, halation, bloom, dramatic atmosphere, centred, rule of thirds, 200mm 1.4f macro shot"
neg_prompt = "semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime, text, close up, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck"
video_frames = pipe(prompt, negative_prompt=neg_prompt, num_frames=16, num_inference_steps=25, guidance_scale=7.5).frames
video_path = export_to_video(video_frames, "test.mp4")
  • Realistic Vision 1.4
import torch
from diffusers import UNet3DConditionModel, TextToVideoAnimateDiffPipeline,DDIMScheduler
from diffusers.utils import export_to_video

unet = UNet3DConditionModel.from_pretrained("itsadarshms/animatediff-v2-stable-diffusion-1.5", subfolder="unet", variant="realistic_vision_v1.4", torch_dtype=torch.float16, use_safetensors=True)
pipe = TextToVideoAnimateDiffPipeline.from_pretrained("SG161222/Realistic_Vision_V1.4", unet=unet, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "close up photo of a rabbit, forest, haze, halation, bloom, dramatic atmosphere, centred, rule of thirds, 200mm 1.4f macro shot"
neg_prompt = "semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime, text, close up, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck"
video_frames = pipe(prompt, negative_prompt=neg_prompt, num_frames=16, num_inference_steps=25, guidance_scale=7.5).frames
video_path = export_to_video(video_frames, "test.mp4")
  • To use the v1 models change the model id to itsadarshms/animatediff-v1-stable-diffusion-1.5
  • To use the v14 variant of AnimaeDiff module set the variant parameter of the UNet3DConditionModel to variant="v14" or variant="v14.realistic_vision_v1.4"

Check the below HF model repos for the available weights

Design Thinking

Since AnimeDiff only require modifications in the Unet, the above HF repos hosts only Unet3D weights that are merged with the temporal module weights from AnimateDiff. While this approach works, I observed a few challenges and would greatly appreciate any suggestions or insights to address them.

  • The current approach requires exporting the Unet weights for every SD variant, i.e, if a user wants to use AnimateDiff with Realistic Vision, then there should be a weight exported from the unet of RVision (like in the usage examples). As there are a lot of community models with different release versions available, this approach won't be ideal.
  • Another approach that came to my mind was only exporting the temporal weights from AnimateDiff in diffusers Unet format and using the SD model weights as such. This would eliminate the need for exporting the weights for all variants but requires overriding the from_pretrained method.
    Β 

Who can review?

@patrickvonplaten @sayakpaul

@itsadarshms
Copy link
Author

If this implementation is good to go, I will work on the rest of the tasks mentioned in the Tasks/TODO section.

@sayakpaul
Copy link
Member

Cc: @DN6

@itsadarshms itsadarshms changed the title Animatediff AnimateDiff Oct 5, 2023
@tumurzakov
Copy link

tumurzakov commented Oct 11, 2023

Hello, take a look at my pipeline, I already implemented a lot of things from your todo. May be you could pick something from it.

  • lora
  • prompt walk
  • train script
  • updated to latest diffusers version
  • use diffusers cross attention rather then reimplementing it as in orig repo

repo

@adhikjoshi
Copy link

Hello, take a look at my pipeline, I already implemented a lot of things from your todo. May be you could pick something from it.

  • lora

  • prompt walk

  • train script

  • updated to latest diffusers version

  • use diffusers cross attention rather then reimplementing it as in orig repo

repo

This is really good, can you open PR?

@itsadarshms
Copy link
Author

itsadarshms commented Oct 12, 2023

@tumurzakov Thanks for sharing your work πŸ™Œ. My current PR includes the following,

  • Unet3D adaptation for AnimateDiff
  • Motion Module adaptation using diffusers Attention class
  • Text2Video Pipeline with infinite context length
  • Image2Video/Video2Video Pipeline (This is in a different branch as of now)

Since I've used the existing diffusers components, LoRA support for Unet is already available. The one in my TODO refers to the support for the new Motion LoRAs in AnimateDiff.

The prompt walk and CN support in your repo looks like good features to have. I've seen some good community showcases of the same. I will add the support for those in my PR.

@itsadarshms
Copy link
Author

@DN6 should I wait for the review of the existing commits or can I merge new commits here. Currently I've the Image2Video/Video2Video pipeline ready for merge.

@DN6
Copy link
Collaborator

DN6 commented Oct 16, 2023

Hi @itsadarshms. Nice work putting this together. After taking a look into the original implementation and how AnimateDiff works, I think it might be better to introduce a dedicated model class for the AnimateDiff UNet.

I've opened this PR with a proposed design. Since AnimateDiff behaves a bit like a ControlNet/Adapter (modifying the intermediate states of a 2D UNet) and allows saving/loading the motion modules separately, I think it would be better to try and follow a similar design paradigm.

You've already brought up the challenges related to fusing the motion weights and 2D UNet weights. A Controlnet/Adapter
style implementation helps circumvent some of those issues, and we wouldn't need to override the save pretrained methods (we can just define dedicated ones in the new UNet and Pipelines for saving motion modules)

Here's what I'm thinking for the API

# Loading motion module into Pipeline
from diffusers import MotionAdapter, AnimateDiffPipeline

motion_adapter = MotionAdapter.from_pretrained("<path to saved motion modules>")
pipe = AnimateDiffPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", motion_adapter=motion_adapter)

# Calling pipe.unet should return an UNetMotionModel object

# Create brand new UNetMotionModel with all random weights
unet = UNetMotionModel()

# Load from an existing 2D UNet and MotionAdapter
unet2D = UNet2DConditionModel.from_pretrained("...")
motion_adapter = MotionAdapter.from_pretrained("...")

unet_motion = UNetMotionModel.from_unet2d(unet2D, motion_adapter: Optional = None)

# Or load motion module after init
unet_motion.load_motion_modules(motion_adapter)

# Save only motion modules
unet_motion.save_motion_module(<path to save model>, push_to_hub=True)

# Save all weights to a single model repo (Including UNet weights) 
unet_motion.save_pretrained()

# Load fused models (Where the motion weights are saved along with the UNet weights in a single repo)
unet_motion = UNetMotionModel.from_pretrained("<path to model>") 

@itsadarshms
Copy link
Author

@DN6 I agree, this design is clean and reduce modifications on the existing modules (that are mostly redundant). Implementing motion LoRAs in this design will be neat as well.

Let me know if you are looking for contributions, I would be happy to collaborate.

@DN6
Copy link
Collaborator

DN6 commented Oct 18, 2023

@itsadarshms Of course. Once we finalize the pipeline, there's a lot of additional functionality that can be added (ControlNets, LoRAs etc).

@Ir1d
Copy link

Ir1d commented Oct 26, 2023

Hi @itsadarshms thanks for the great work! Can you please share how you created the unet safetensors?

unet = UNet3DConditionModel.from_pretrained("itsadarshms/animatediff-v2-stable-diffusion-1.5", subfolder="unet", variant="realistic_vision_v1.4", torch_dtype=torch.float16, use_safetensors=True)

I'm hoping to convert other base models such as RcnzCartoon

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Nov 21, 2023
@github-actions github-actions bot closed this Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants