Skip to content

Official codes of VEnhancer: Generative Space-Time Enhancement for Video Generation

Notifications You must be signed in to change notification settings

Vchitect/VEnhancer

Repository files navigation

VEnhancer: Generative Space-Time Enhancement
for Video Generation

Peng Gao,  Dahua Lin,  Yu Qiao,  Wanli Ouyang,  Ziwei Liu
The Chinese University of Hong Kong, Shanghai Artificial Intelligence Laboratory, 
S-Lab, Nanyang Technological University 

VEnhancer, a generative space-time enhancement framework that can improve the existing T2V results.

AIGC video +VEnhancer

📖 For more visual results, go checkout our project page


🔥 Update

  • [2024.09.10] 😸 Support Multiple GPU Inference and tiled VAE for temporal VAE decoding. And more stable performance for long video enhancement.
  • [2024.08.18] 😸 Support enhancement for abitrary long videos (by spliting the videos into muliple chunks with overlaps); Faster sampling with only 15 steps without obvious quality loss (by setting --solver_mode 'fast' in the script command); Use temporal VAE to reduce video flickering.
  • [2024.07.28] 🔥 Inference code and pretrained video enhancement model are released.
  • [2024.07.10] 🤗 This repo is created.

Open Source Plan

  • Release code of Multiple GPU Inference.
  • Release code of tiled VAE.
  • Release model that is optimized for better idenity preservation.

⭐⭐⭐ Star us ⭐⭐⭐! And we will speed up the open-sourcing process ❤️.

🔥🔥 News

  • [2024.09.02] We have enhanced T2V results from Open-Sora 🤗.

Prompt: a close-up shot of a woman standing in a dimly lit room. she is wearing a traditional chinese outfit, which includes a red and gold dress with intricate designs and a matching headpiece.

profile.mp4
  • [2024.08.23] We have enhanced T2V results from keling 🤗.

Prompt: A little brick man visiting an art gallery.

brickman_art_gallery.mp4
A.little.brick.man.visiting.an.art.gallery.mp4
  • [2024.08.19] We have enhanced some T2V results from CogVideoX 🤗.

Prompt: A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea.

boat_input.mp4
boat_up3.mp4

🎬 Overview

VEnhancer achieves spatial super-resolution, temporal super-resolution (i.e, frame interpolation), and video refinement in one model. It is flexible to adapt to different upsampling factors (e.g., 1x~8x) for either spatial or temporal super-resolution. Besides, it provides flexible control to modify the refinement strength for handling diversified video artifacts.

It follows ControlNet and copies the architecures and weights of multi-frame encoder and middle block of a pretrained video diffusion model to build a trainable condition network. This video ControlNet accepts both low-resolution key frames and full frames of noisy latents as inputs. Also, the noise level $\sigma$ regarding noise augmentation and downscaling factor $s$ serve as additional network conditioning through our proposed video-aware conditioning apart from timestep $t$ and prompt $c_{text}$.

⚙️ Installation

# clone this repo
git clone https://github.com/Vchitect/VEnhancer.git
cd VEnhancer

# create environment
conda create -n venhancer python=3.10
conda activate venhancer
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
pip install -r requirements.txt

Note that ffmpeg command should be enabled. If you have sudo access, then you can install it using the following command:

sudo apt-get update && apt-get install ffmpeg libsm6 libxext6  -y

🧬 Pretrained Models

Model Name Description HuggingFace BaiduNetdisk
venhancer_paper.pth video enhancement model, paper version, very creative. download download

💫 Inference

  1. Download the VEnhancer model and then put the checkpoint in the VEnhancer/ckpts directory. (optional as it can be done automatically)
  2. run the following command
  bash run_VEnhancer.sh

for single GPU inference, or

  bash run_VEnhancer_MultiGPU.sh

for muliple GPU inference.

In run_VEnhancer.sh or run_VEnhancer_MultiGPU.sh,

  • up_scale is the upsampling factor ($1\sim8$) for spatial super-resolution. $\times3,4$ are recommended. Note that the target resolution will be adjusted no higher than 2k resolution.
  • target_fps is your expected target fps, and the default is 24.
  • noise_aug is the noise level ($0\sim300$) regarding noise augmentation. higher noise corresponds to stronger refinement. $200\sim300$ are recommended.
  • Regarding prompt, you can use --filename_as_prompt to automatically use filename as prompt; or you can write the prompt to a txt file, and specify the prompt_path by setting --prompt_path [your_prompt_path]; or directly provide the prompt by specifying --prompt [your_prompt].

Gradio

The same functionality is also available as a gradio demo

python gradio_app.py

BibTeX

If you use our work in your research, please cite our publication:

@article{he2024venhancer,
  title={VEnhancer: Generative Space-Time Enhancement for Video Generation},
  author={He, Jingwen and Xue, Tianfan and Liu, Dongyang and Lin, Xinqi and Gao, Peng and Lin, Dahua and Qiao, Yu and Ouyang, Wanli and Liu, Ziwei},
  journal={arXiv preprint arXiv:2407.07667},
  year={2024}
}

🤗 Acknowledgements

Our codebase builds on modelscope. Thanks the authors for sharing their awesome codebases!

📧 Contact

If you have any questions, please feel free to reach us at hejingwenhejingwen@outlook.com.