Jianzong Wu
·
Xiangtai Li
·
Yanhong Zeng
·
Jiangning Zhang
.
Qianyu Zhou
.
Yining Li
·
Yunhai Tong
.
Kai Chen
Customization and subject motion control
Hybrid control on customization, subject and camera motion
- [2024-6-28] Inference code, training code, and checkpoints are released!
In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Models and codes will be made publicly available.
- In this repo, we use Python 3.11 and PyTorch 2.1.2. Newer versions of Python and PyTorch may be also compatible.
# Create a new environment with Conda
conda create -n motionbooth python=3.11
conda activate motionbooth
# Install PyTorch
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
- We strongly recommend using Diffusers to be the codebase for the training and inference of diffusion-based models. Diffusers provides easy and compatible implementations around diffusion-based generative models.
- In this repo, we use diffusers 0.29.0, if you use newer or older versions, you may need to adjust some import function paths manually. Please refer to the Diffusers document for details.
# Install diffusers, transformers, and accelerate
conda install -c conda-forge diffusers==0.29.0 transformers accelerate
# Install xformers for PyTorch 2.1.2
pip install xformers==0.0.23.post1
# Install other dependencies
pip install -r requirements.txt
We collect 26 objects from DreamBooth and CustomDiffusion to perform the experiments in paper. These objects include pets, plushies, toys, cartoons, and vehicles. We also annotate masks for each image. We name it the MotionBooth dataset. Please download our dataset from huggingface.
Note that a few images from the original datasets are deleted because the low quality of the obtained masks. Additionally, a few images are resized and cropped to square shapes.
After downloading, please unzip and place the dataset under the data
folder. It should look like this:
data
|- MotionBooth
|- images
|- cat2
|- ...
|- masks
|- cat2
|- ...
|- scripts
We use Zeroscope and LaVie-base for the base T2V models. Please download Zeroscope from the official huggingface page. For LaVie, we provide a script to convert their original checkpoint into the format that is suitable for Diffusers. Please download the LaVie-base model and the Stable-Diffusion-v1.4 checkpoint.
Then, organize the pre-trained models in the checkpoints
folder.
checkpoints
|- zeroscope_v2_576w
|- stable-diffusion-v1-4
|- lavie_base.pt
Then, run the following command to convert the checkpoint
python -m scripts.convert_ckpts.convert_lavie_to_diffusers
Then, rename the stable-diffusion-v1-4
folder to lavie
. Additionally, you should replace the config file to LaVie's configs, following checkpoint guide.
The final checkpoint folder looks like this:
checkpoints
|- zeroscope_v2_576w
|- lavie
|- lavie_base.pt (Not used anymore)
We use the converted lavie model for all the experiments.
For quick inference and re-producing the examples in paper, please download our trained customized checkpoints for the target subjects in huggingface. The names of the checkpoints correspond to the subject names in the MotionBooth dataset.
Please place the checkpoints in che checkpoints
folder like this:
checkpoints
|- customized
|- zeroscope
|- ...
|- lavie
|- ...
|- zeroscope_v2_576w
|- lavie
|- lavie_base.pt (Not used anymore)
We use simple script files to indicate the subject and camera motion. We provide several examples in data/scripts
. In these script files, the "bbox" controls the bounding box sequence for the subjects' motion, while the "camera speed" controls the corresponding camera motion speed.
We provide the inference script in scripts/inference.py
for all types of MotionBooth applications. It uses Accelerate PartialState to support multi GPU inference.
The latent shift module proposed in paper can control the camera motion freely whether or not with a customized model. We provide scripts in data/scripts/camera
to control the camera motion in vanilla text-to-video pipelines.
python -m scripts.inference \
--script_path data/scripts/camera/waterfall.json \
--model_name lavie \
--num_samples 1 \
--start_shift_step 10 \
--max_shift_steps 10
You can check the meaning of each parameter in the bottom of the script file.
For multi GPU inference, please run commands like this:
accelerate launch \
--multi_gpu \
-m scripts.inference \
--script_path data/scripts/camera/waterfall.json \
--model_name lavie \
--num_samples 8 \
--start_shift_step 10 \
--max_shift_steps 10
Feel free to try other scripts in data/scripts/camera
and your own text prompts or camera speeds!
By loading the checkpoint fine-tuned on a specific subject, our latent shift module can control the camera motion of the generated videos when depicting the given subject.
python -m scripts.inference \
--script_path data/scripts/customized_camera/run_grass.json \
--model_name lavie \
--customize_ckpt_path checkpoints/customized/lavie/plushie_panda.pth \
--class_name "plushie panda" \
--num_samples 1 \
--start_shift_step 10 \
--max_shift_steps 10
The subject motion control can also be complished with minimal computational and time cost added.
python -m scripts.inference \
--script_path data/scripts/customized_subject/jump_stairs.json \
--model_name zeroscope \
--customize_ckpt_path checkpoints/customized/zeroscope/pet_cat1.pth \
--class_name cat \
--num_samples 1 \
--edit_scale 7.5 \
--max_amp_steps 5
MotionBooth can also control both the camera and subject motion
python -m scripts.inference \
--script_path data/scripts/customized_both/swim_coral.json \
--model_name lavie \
--customize_ckpt_path checkpoints/customized/lavie/plushie_happysad.pth \
--class_name "plushie happysad" \
--num_samples 1 \
--edit_scale 10.0 \
--max_amp_steps 15 \
--start_shift_step 10 \
--max_shift_steps 10 \
--base_seed 5
Note: An 80G memory GPU is needed for training on 24-frame video data!
Before training MotionBooth, please download video-text pair data from Panda-70M.
Please first download the panda70m_training_2m.csv from Panda-70M official release and place it into data/panda/panda70m_training_2m.csv
.
To download random videos from the training set, we provide an easy-to-use downloading script for downloading and organizing the videos from YouTube.
python -m scripts.download_dataset.panda70m
After downloading, your data
folder should look like this:
data
|- MotionBooth
|- scripts
|- panda
|- random_500
|- video1
|- frame1
|- frame2
|- ...
|- video2
|- frame1
|- frame2
|- ...
|- ...
|- captions_random.json
|- data/panda/panda70m_training_2m.csv
The training procedure is as simple as running scripts/train.py
. This is an example training LaVie on "dog3" in the MotionBooth dataset.
python -m scripts.train \
--config_path configs/lavie.yaml \
--obj_name dog3
For tuning Zeroscope/LaVie for 300 steps, it takes you less than 20 minutes.
After the training is completed, you can place the saved checkpoints in the logs
folder to checkpoints/customized/
and run the inference!
And of course, you can prepare your own object and save the images and masks just like MotionBooth dataset.
Our framework is the first that is capable of generating diverse videos by taking any combination of customized subjects, subject motions, and camera movements as input. However, due to the variaity of generative video prior, the success rate is not guaranteed. Be patient and generate more samples under different random seeds to have better results. 🤗
article{wu2024motionbooth,
title={MotionBooth: Motion-Aware Customized Text-to-Video Generation},
author={Jianzong Wu and Xiangtai Li and Yanhong Zeng and Jiangning Zhang and Qianyu Zhou and Yining Li and Yunhai Tong and Kai Chen},
journal={NeurIPS},
year={2024},
}