AnimateDiff #5296

itsadarshms · 2023-10-05T09:32:04Z

What does this PR do?

This PR implements AnimateDiff as discussed in #4524

Status -> 🧑‍💻 WIP

Model/Pipeline Description

Project Page: https://animatediff.github.io/
Code & pre-trained weights: https://github.com/guoyww/AnimateDiff
Paper: https://arxiv.org/abs/2307.04725
Abstract:

With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. Subsequently, there is a great demand for image animation techniques to further combine generated static images with motion dynamics. In this report, we propose a practical framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to insert a newly initialized motion modeling module into the frozen text-to-image model and train it on video clips to distill reasonable motion priors. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base T2I readily become text-driven models that produce diverse and personalized animated images. We conduct our evaluation on several public representative personalized text-to-image models across anime pictures and realistic photographs, and demonstrate that our proposed framework helps these models generate temporally smooth animation clips while preserving the domain and diversity of their outputs.

Tasks/TODO

Implement Text-to-Video AnimateDiff Pipeline
Implement Image-to-Video AnimateDiff Pipeline
Implement support for longer frame generation (> 16) as in animatediff-cli
Add Motion LoRA support
Write relevant test cases
Add/Update docstrings
Create Documentation
Add usage examples

Usage Examples

Stable Diffusion 1.5

import torch
from diffusers import UNet3DConditionModel, TextToVideoAnimateDiffPipeline,DDIMScheduler
from diffusers.utils import export_to_video

unet = UNet3DConditionModel.from_pretrained("itsadarshms/animatediff-v2-stable-diffusion-1.5", subfolder="unet", torch_dtype=torch.float16, use_safetensors=True)
pipe = TextToVideoAnimateDiffPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", unet=unet, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "close up photo of a rabbit, forest, haze, halation, bloom, dramatic atmosphere, centred, rule of thirds, 200mm 1.4f macro shot"
neg_prompt = "semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime, text, close up, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck"
video_frames = pipe(prompt, negative_prompt=neg_prompt, num_frames=16, num_inference_steps=25, guidance_scale=7.5).frames
video_path = export_to_video(video_frames, "test.mp4")

Realistic Vision 1.4

import torch
from diffusers import UNet3DConditionModel, TextToVideoAnimateDiffPipeline,DDIMScheduler
from diffusers.utils import export_to_video

unet = UNet3DConditionModel.from_pretrained("itsadarshms/animatediff-v2-stable-diffusion-1.5", subfolder="unet", variant="realistic_vision_v1.4", torch_dtype=torch.float16, use_safetensors=True)
pipe = TextToVideoAnimateDiffPipeline.from_pretrained("SG161222/Realistic_Vision_V1.4", unet=unet, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "close up photo of a rabbit, forest, haze, halation, bloom, dramatic atmosphere, centred, rule of thirds, 200mm 1.4f macro shot"
neg_prompt = "semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime, text, close up, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck"
video_frames = pipe(prompt, negative_prompt=neg_prompt, num_frames=16, num_inference_steps=25, guidance_scale=7.5).frames
video_path = export_to_video(video_frames, "test.mp4")

To use the v1 models change the model id to itsadarshms/animatediff-v1-stable-diffusion-1.5
To use the v14 variant of AnimaeDiff module set the variant parameter of the UNet3DConditionModel to variant="v14" or variant="v14.realistic_vision_v1.4"

Check the below HF model repos for the available weights

Design Thinking

Since AnimeDiff only require modifications in the Unet, the above HF repos hosts only Unet3D weights that are merged with the temporal module weights from AnimateDiff. While this approach works, I observed a few challenges and would greatly appreciate any suggestions or insights to address them.

The current approach requires exporting the Unet weights for every SD variant, i.e, if a user wants to use AnimateDiff with Realistic Vision, then there should be a weight exported from the unet of RVision (like in the usage examples). As there are a lot of community models with different release versions available, this approach won't be ideal.
Another approach that came to my mind was only exporting the temporal weights from AnimateDiff in diffusers Unet format and using the SD model weights as such. This would eliminate the need for exporting the weights for all variants but requires overriding the from_pretrained method.

Who can review?

@patrickvonplaten @sayakpaul

…ediff

itsadarshms · 2023-10-05T09:33:29Z

If this implementation is good to go, I will work on the rest of the tasks mentioned in the Tasks/TODO section.

sayakpaul · 2023-10-05T09:39:37Z

Cc: @DN6

tumurzakov · 2023-10-11T19:38:17Z

Hello, take a look at my pipeline, I already implemented a lot of things from your todo. May be you could pick something from it.

lora
prompt walk
train script
updated to latest diffusers version
use diffusers cross attention rather then reimplementing it as in orig repo

repo

adhikjoshi · 2023-10-12T02:14:06Z

Hello, take a look at my pipeline, I already implemented a lot of things from your todo. May be you could pick something from it.

lora

prompt walk

train script

updated to latest diffusers version

use diffusers cross attention rather then reimplementing it as in orig repo

repo

This is really good, can you open PR?

itsadarshms · 2023-10-12T05:50:13Z

@tumurzakov Thanks for sharing your work 🙌. My current PR includes the following,

Unet3D adaptation for AnimateDiff
Motion Module adaptation using diffusers Attention class
Text2Video Pipeline with infinite context length
Image2Video/Video2Video Pipeline (This is in a different branch as of now)

Since I've used the existing diffusers components, LoRA support for Unet is already available. The one in my TODO refers to the support for the new Motion LoRAs in AnimateDiff.

The prompt walk and CN support in your repo looks like good features to have. I've seen some good community showcases of the same. I will add the support for those in my PR.

itsadarshms · 2023-10-12T05:53:01Z

@DN6 should I wait for the review of the existing commits or can I merge new commits here. Currently I've the Image2Video/Video2Video pipeline ready for merge.

DN6 · 2023-10-16T18:00:54Z

Hi @itsadarshms. Nice work putting this together. After taking a look into the original implementation and how AnimateDiff works, I think it might be better to introduce a dedicated model class for the AnimateDiff UNet.

I've opened this PR with a proposed design. Since AnimateDiff behaves a bit like a ControlNet/Adapter (modifying the intermediate states of a 2D UNet) and allows saving/loading the motion modules separately, I think it would be better to try and follow a similar design paradigm.

You've already brought up the challenges related to fusing the motion weights and 2D UNet weights. A Controlnet/Adapter
style implementation helps circumvent some of those issues, and we wouldn't need to override the save pretrained methods (we can just define dedicated ones in the new UNet and Pipelines for saving motion modules)

Here's what I'm thinking for the API

# Loading motion module into Pipeline
from diffusers import MotionAdapter, AnimateDiffPipeline

motion_adapter = MotionAdapter.from_pretrained("<path to saved motion modules>")
pipe = AnimateDiffPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", motion_adapter=motion_adapter)

# Calling pipe.unet should return an UNetMotionModel object

# Create brand new UNetMotionModel with all random weights
unet = UNetMotionModel()

# Load from an existing 2D UNet and MotionAdapter
unet2D = UNet2DConditionModel.from_pretrained("...")
motion_adapter = MotionAdapter.from_pretrained("...")

unet_motion = UNetMotionModel.from_unet2d(unet2D, motion_adapter: Optional = None)

# Or load motion module after init
unet_motion.load_motion_modules(motion_adapter)

# Save only motion modules
unet_motion.save_motion_module(<path to save model>, push_to_hub=True)

# Save all weights to a single model repo (Including UNet weights) 
unet_motion.save_pretrained()

# Load fused models (Where the motion weights are saved along with the UNet weights in a single repo)
unet_motion = UNetMotionModel.from_pretrained("<path to model>")

itsadarshms · 2023-10-17T05:05:23Z

@DN6 I agree, this design is clean and reduce modifications on the existing modules (that are mostly redundant). Implementing motion LoRAs in this design will be neat as well.

Let me know if you are looking for contributions, I would be happy to collaborate.

DN6 · 2023-10-18T13:48:36Z

@itsadarshms Of course. Once we finalize the pipeline, there's a lot of additional functionality that can be added (ControlNets, LoRAs etc).

Ir1d · 2023-10-26T00:24:37Z

Hi @itsadarshms thanks for the great work! Can you please share how you created the unet safetensors?

unet = UNet3DConditionModel.from_pretrained("itsadarshms/animatediff-v2-stable-diffusion-1.5", subfolder="unet", variant="realistic_vision_v1.4", torch_dtype=torch.float16, use_safetensors=True)

I'm hoping to convert other base models such as RcnzCartoon

github-actions · 2023-11-21T15:05:30Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

itsadarshms added 10 commits October 2, 2023 11:06

Add temporal attention to BasicTransformerBlock

fb607d5

Add animatediff in unet3dcondition and unet3dblocks

041d863

Add TemporalTransformerBlock

39b849a

WIP: Add AnimateDiff Pipeline

24f7f79

Refactoring and minor bug fixes

b7a4390

Remove context length hack, fix bugs unet3D for videosynth and animat…

e54a80d

…ediff

Add long context video generation support

6bce597

Update example docs

ca4286c

Merge branch 'main' of github.com:itsadarshms/diffusers into animatediff

7fee456

Update docstring for animatediff

fd07e0c

itsadarshms changed the title ~~Animatediff~~ AnimateDiff Oct 5, 2023

patrickvonplaten requested a review from DN6 October 6, 2023 10:20

github-actions bot added the stale Issues that haven't received updates label Nov 21, 2023

github-actions bot closed this Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AnimateDiff #5296

AnimateDiff #5296

itsadarshms commented Oct 5, 2023 •

edited

Loading

itsadarshms commented Oct 5, 2023

sayakpaul commented Oct 5, 2023

tumurzakov commented Oct 11, 2023 •

edited

Loading

adhikjoshi commented Oct 12, 2023

itsadarshms commented Oct 12, 2023 •

edited

Loading

itsadarshms commented Oct 12, 2023

DN6 commented Oct 16, 2023

itsadarshms commented Oct 17, 2023

DN6 commented Oct 18, 2023

Ir1d commented Oct 26, 2023

github-actions bot commented Nov 21, 2023

AnimateDiff #5296

AnimateDiff #5296

Conversation

itsadarshms commented Oct 5, 2023 • edited Loading

What does this PR do?

Model/Pipeline Description

Tasks/TODO

Usage Examples

Design Thinking

Who can review?

itsadarshms commented Oct 5, 2023

sayakpaul commented Oct 5, 2023

tumurzakov commented Oct 11, 2023 • edited Loading

adhikjoshi commented Oct 12, 2023

itsadarshms commented Oct 12, 2023 • edited Loading

itsadarshms commented Oct 12, 2023

DN6 commented Oct 16, 2023

itsadarshms commented Oct 17, 2023

DN6 commented Oct 18, 2023

Ir1d commented Oct 26, 2023

github-actions bot commented Nov 21, 2023

itsadarshms commented Oct 5, 2023 •

edited

Loading

tumurzakov commented Oct 11, 2023 •

edited

Loading

itsadarshms commented Oct 12, 2023 •

edited

Loading