MotionBooth: Motion-Aware Customized
Text-to-Video Generation

Jianzong Wu · Xiangtai Li · Yanhong Zeng · Jiangning Zhang . Qianyu Zhou . Yining Li · Yunhai Tong . Kai Chen

Examples

Customization and subject motion control

Hybrid control on customization, subject and camera motion

🎉 News

[2024-6-28] Inference code, training code, and checkpoints are released!

📖 Abstract

In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Models and codes will be made publicly available.

🛠️ Quick Start

Installation

In this repo, we use Python 3.11 and PyTorch 2.1.2. Newer versions of Python and PyTorch may be also compatible.

# Create a new environment with Conda
conda create -n motionbooth python=3.11
conda activate motionbooth
# Install PyTorch
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

We strongly recommend using Diffusers to be the codebase for the training and inference of diffusion-based models. Diffusers provides easy and compatible implementations around diffusion-based generative models.
In this repo, we use diffusers 0.29.0, if you use newer or older versions, you may need to adjust some import function paths manually. Please refer to the Diffusers document for details.

# Install diffusers, transformers, and accelerate
conda install -c conda-forge diffusers==0.29.0 transformers accelerate
# Install xformers for PyTorch 2.1.2
pip install xformers==0.0.23.post1
# Install other dependencies
pip install -r requirements.txt

Data Preparation

We collect 26 objects from DreamBooth and CustomDiffusion to perform the experiments in paper. These objects include pets, plushies, toys, cartoons, and vehicles. We also annotate masks for each image. We name it the MotionBooth dataset. Please download our dataset from huggingface.

Note that a few images from the original datasets are deleted because the low quality of the obtained masks. Additionally, a few images are resized and cropped to square shapes.

After downloading, please unzip and place the dataset under the data folder. It should look like this:

data
  |- MotionBooth
    |- images
      |- cat2
      |- ...
    |- masks
      |- cat2
      |- ...
  |- scripts

Pre-trained Model Preparation

We use Zeroscope and LaVie-base for the base T2V models. Please download Zeroscope from the official huggingface page. For LaVie, we provide a script to convert their original checkpoint into the format that is suitable for Diffusers. Please download the LaVie-base model and the Stable-Diffusion-v1.4 checkpoint.

Then, organize the pre-trained models in the checkpoints folder.

checkpoints
  |- zeroscope_v2_576w
  |- stable-diffusion-v1-4
  |- lavie_base.pt

Then, run the following command to convert the checkpoint

python -m scripts.convert_ckpts.convert_lavie_to_diffusers

Then, rename the stable-diffusion-v1-4 folder to lavie. Additionally, you should replace the config file to LaVie's configs, following checkpoint guide.

The final checkpoint folder looks like this:

checkpoints
  |- zeroscope_v2_576w
  |- lavie
  |- lavie_base.pt (Not used anymore)

We use the converted lavie model for all the experiments.

Inference

For quick inference and re-producing the examples in paper, please download our trained customized checkpoints for the target subjects in huggingface. The names of the checkpoints correspond to the subject names in the MotionBooth dataset.

Please place the checkpoints in che checkpoints folder like this:

checkpoints
  |- customized
    |- zeroscope
      |- ...
    |- lavie
      |- ...
  |- zeroscope_v2_576w
  |- lavie
  |- lavie_base.pt (Not used anymore)

We use simple script files to indicate the subject and camera motion. We provide several examples in data/scripts. In these script files, the "bbox" controls the bounding box sequence for the subjects' motion, while the "camera speed" controls the corresponding camera motion speed.

We provide the inference script in scripts/inference.py for all types of MotionBooth applications. It uses Accelerate PartialState to support multi GPU inference.

[Vanilla T2V] Control the camera motion without customized subjects

The latent shift module proposed in paper can control the camera motion freely whether or not with a customized model. We provide scripts in data/scripts/camera to control the camera motion in vanilla text-to-video pipelines.

python -m scripts.inference \
    --script_path data/scripts/camera/waterfall.json \
    --model_name lavie \
    --num_samples 1 \
    --start_shift_step 10 \
    --max_shift_steps 10

You can check the meaning of each parameter in the bottom of the script file.

For multi GPU inference, please run commands like this:

accelerate launch \
    --multi_gpu \
    -m scripts.inference \
    --script_path data/scripts/camera/waterfall.json \
    --model_name lavie \
    --num_samples 8 \
    --start_shift_step 10 \
    --max_shift_steps 10

Feel free to try other scripts in data/scripts/camera and your own text prompts or camera speeds!

[Customized T2V] Control the camera motion

By loading the checkpoint fine-tuned on a specific subject, our latent shift module can control the camera motion of the generated videos when depicting the given subject.

python -m scripts.inference \
    --script_path data/scripts/customized_camera/run_grass.json \
    --model_name lavie \
    --customize_ckpt_path checkpoints/customized/lavie/plushie_panda.pth \
    --class_name "plushie panda" \
    --num_samples 1 \
    --start_shift_step 10 \
    --max_shift_steps 10

[Customized T2V] Control the subject motion

The subject motion control can also be complished with minimal computational and time cost added.

python -m scripts.inference \
    --script_path data/scripts/customized_subject/jump_stairs.json \
    --model_name zeroscope \
    --customize_ckpt_path checkpoints/customized/zeroscope/pet_cat1.pth \
    --class_name cat \
    --num_samples 1 \
    --edit_scale 7.5 \
    --max_amp_steps 5

[Customized T2V] Control both the camera and subject motion

MotionBooth can also control both the camera and subject motion

python -m scripts.inference \
    --script_path data/scripts/customized_both/swim_coral.json \
    --model_name lavie \
    --customize_ckpt_path checkpoints/customized/lavie/plushie_happysad.pth \
    --class_name "plushie happysad" \
    --num_samples 1 \
    --edit_scale 10.0 \
    --max_amp_steps 15 \
    --start_shift_step 10 \
    --max_shift_steps 10 \
    --base_seed 5

Train

Download Preservation Video Data

Note: An 80G memory GPU is needed for training on 24-frame video data!

Before training MotionBooth, please download video-text pair data from Panda-70M.

Please first download the panda70m_training_2m.csv from Panda-70M official release and place it into data/panda/panda70m_training_2m.csv.

To download random videos from the training set, we provide an easy-to-use downloading script for downloading and organizing the videos from YouTube.

python -m scripts.download_dataset.panda70m

After downloading, your data folder should look like this:

data
  |- MotionBooth
  |- scripts
  |- panda
    |- random_500
      |- video1
        |- frame1
        |- frame2
        |- ...
      |- video2
        |- frame1
        |- frame2
        |- ...
      |- ...
    |- captions_random.json
    |- data/panda/panda70m_training_2m.csv

Train MotionBooth

The training procedure is as simple as running scripts/train.py. This is an example training LaVie on "dog3" in the MotionBooth dataset.

python -m scripts.train \
    --config_path configs/lavie.yaml \
    --obj_name dog3

For tuning Zeroscope/LaVie for 300 steps, it takes you less than 20 minutes.

After the training is completed, you can place the saved checkpoints in the logs folder to checkpoints/customized/ and run the inference!

And of course, you can prepare your own object and save the images and masks just like MotionBooth dataset.

📢 Disclaimer

Our framework is the first that is capable of generating diverse videos by taking any combination of customized subjects, subject motions, and camera movements as input. However, due to the variaity of generative video prior, the success rate is not guaranteed. Be patient and generate more samples under different random seeds to have better results. 🤗

Citation

article{wu2024motionbooth,
  title={MotionBooth: Motion-Aware Customized Text-to-Video Generation},
  author={Jianzong Wu and Xiangtai Li and Yanhong Zeng and Jiangning Zhang and Qianyu Zhou and Yining Li and Yunhai Tong and Kai Chen},
  journal={NeurIPS},
  year={2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
assets		assets
configs		configs
data/scripts		data/scripts
docs		docs
fonts		fonts
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MotionBooth: Motion-Aware Customized
Text-to-Video Generation

Examples

🎉 News

📖 Abstract

🛠️ Quick Start

Installation

Data Preparation

Pre-trained Model Preparation

Inference

[Vanilla T2V] Control the camera motion without customized subjects

[Customized T2V] Control the camera motion

[Customized T2V] Control the subject motion

[Customized T2V] Control both the camera and subject motion

Train

Download Preservation Video Data

Train MotionBooth

📢 Disclaimer

Citation

About

Releases

Packages

Languages

jianzongwu/MotionBooth

Folders and files

Latest commit

History

Repository files navigation

MotionBooth: Motion-Aware Customized Text-to-Video Generation

Examples

🎉 News

📖 Abstract

🛠️ Quick Start

Installation

Data Preparation

Pre-trained Model Preparation

Inference

[Vanilla T2V] Control the camera motion without customized subjects

[Customized T2V] Control the camera motion

[Customized T2V] Control the subject motion

[Customized T2V] Control both the camera and subject motion

Train

Download Preservation Video Data

Train MotionBooth

📢 Disclaimer

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

MotionBooth: Motion-Aware Customized
Text-to-Video Generation

Packages