InstructVideo: Instructing Video Diffusion Models
with Human Feedback

Hangjie Yuan Shiwei Zhang Xiang Wang Yujie Wei Tao Feng
Yining Pan Yingya Zhang Ziwei Liu Samuel Albanie Dong Ni

Accepted to CVPR 2024 🥳

Abstract: Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the ab sence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models will be made publicly available at this repo.

Todo list

Note that if you can not get access to the links provided below, try using another browser or contact me by e-mail or raise an issue. Feel free to reach out (hj.yuan@zju.edu.cn) if have questions.

🎉 Release code for fine-tuning and inference.
🎉 Release pre-training and fine-tuning data list (should be obtained from WebVid10M).
🎉 Release pre-training and fine-tuned checkpoints.

InstructVideo

The code can be found in the VGen GitHub page.

Dataset preparation and environment configuration

The training of InstructVideo requires video-text pairs to save computational cost during reward fine-tuning. In the paper, we utilize a small set of videos in WebVid to fine-tune our base model. The file list is shown under the folder:

data/instructvideo/webvid_simple_animals_2_selected_20_train_file_list/00000.txt

You should try filtering the videos from your webvid dataset to compose the training data. Another alternative is to use your own video-text pairs. (I tested InstructVideo on WebVid data and some proprietary data. Both worked.)

Concerning the environment configuration, you should follow the instructions for VGen installation.

Pre-trained weights preparation

!pip install modelscope
from modelscope.hub.snapshot_download import snapshot_download
model_dir = snapshot_download('iic/InstructVideo', cache_dir='models/')

You need to move the checkpoints to the "models/" directory:

mv ./models/iic/InstructVideo/* ./models/

Note that models/model_scope_v1-4_0600000.pth is the pre-trained base model used in the paper. The fine-tuned model is placed under the folder models/instructvideo-finetuned.

You can get access to the provided files on Instructvideo ModelScope Page.

The inference of InstructVideo

You can leverage the provided fine-tuned checkpoints to generate videos by running the command:

bash configs/instructvideo/eval_generate_videos.sh

This command uses yaml files under configs/instructvideo/eval, containing caption file paths for generating videos of in-domain animals, new animals and non-animals. Feel free to switch among them or replace them with your own captions. Although we fine-tuned using 20-step DDIM, you can still use 50-step DDIM generation.

The reward fine-tuning of InstructVideo

You can perform InstrcutVideo reward fine-tuning by running the command:

bash configs/instructvideo/train.sh

Since performing reward fine-tuning can lead to over-optimization, I strongly recommend checking the generation performance on some evaluation captions regularly (like the captions indicated in configs/instructvideo/eval).

Citation

@article{2023InstructVideo,
    title={InstructVideo: Instructing Video Diffusion Models with Human Feedback},
    author={Yuan, Hangjie and Zhang, Shiwei and Wang, Xiang and Wei, Yujie and Feng, Tao and Pan, Yining and Zhang, Yingya and Liu, Ziwei and Albanie, Samuel and Ni, Dong},
    booktitle={arXiv preprint arXiv:2312.12490},
    year={2023}
}

@article{wang2023modelscope,
  title={Modelscope text-to-video technical report},
  author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},
  journal={arXiv preprint arXiv:2308.06571},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InstructVideo.md

InstructVideo.md

InstructVideo: Instructing Video Diffusion Models
with Human Feedback

Todo list

InstructVideo

Dataset preparation and environment configuration

Pre-trained weights preparation

The inference of InstructVideo

The reward fine-tuning of InstructVideo

Citation

Files

InstructVideo.md

Latest commit

History

InstructVideo.md

File metadata and controls

InstructVideo: Instructing Video Diffusion Modelswith Human Feedback

Todo list

InstructVideo

Dataset preparation and environment configuration

Pre-trained weights preparation

The inference of InstructVideo

The reward fine-tuning of InstructVideo

Citation

InstructVideo: Instructing Video Diffusion Models
with Human Feedback