Autoregressive Video Generation without Vector Quantization

Haoge Deng^1,4*, Ting Pan^2,4*, Haiwen Diao^3,4*, Zhengxiong Luo^4*, Yufeng Cui⁴
Huchuan Lu³, Shiguang Shan², Yonggang Qi¹, Xinlong Wang^4†

BUPT¹, ICT-CAS², DLUT³, BAAI⁴
^* Equal Contribution, ^† Corresponding Author

We present NOVA (NOn-Quantized Video Autoregressive Model), a model that enables autoregressive image/video generation with high efficiency. NOVA reformulates the video generation problem as non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. NOVA generalizes well and enables diverse zero-shot generation abilities in one unified model.

🚀News

[Dec 2024] Released 🤗 Online Demo (T2I, T2V)
[Dec 2024] Released paper, weights, and Quick Start guide and Gradio Demo local code .

✨Hightlights

🔥 Novel Approach: Non-quantized video autoregressive generation.
🔥 State-of-the-art Performance: High efficiency with state-of-the-art t2i/t2v results.
🔥 Unified Modeling: Multi-task capabilities in a single unified model.

🗄️Model Zoo

See detailed description in Model Zoo

Text to Image

Model	Parameters	Resolution	Data	Weight	GenEval	DPGBench
NOVA-0.6B	0.6B	512x512	16M	🤗 HF link	0.75	81.76
NOVA-0.3B	0.3B	1024x1024	600M	🤗 HF link	0.67	80.60
NOVA-0.6B	0.6B	1024x1024	600M	🤗 HF link	0.69	82.25
NOVA-1.4B	1.4B	1024x1024	600M	🤗 HF link	0.71	83.01

Text to Video

Model	Parameters	Resolution	Data	Weight	VBench
NOVA-0.6B	0.6B	33x768x480	20M	🤗 HF link	80.12

📖Table of Contents

1. Installation
- 1.1 From Source
- 1.2 From Git
2. Quick Start
- 2.1 Text to Image
- 2.2 Text to Video
3. Gradio Demo
4. Train
5. Inference
6. Evaluation

1. Installation

1.1 From Source

Clone this repository to local disk and install:

pip install diffusers transformers accelerate imageio[ffmpeg]
git clone https://github.com/baaivision/NOVA.git
cd NOVA && pip install .

1.2 From Git

You can also install from the remote repository if you have set your Github SSH key:

pip install diffusers transformers accelerate imageio[ffmpeg]
pip install git+ssh://git@github.com/baaivision/NOVA.git

2. Quick Start

2.1 Text to Image

import torch
from diffnext.pipelines import NOVAPipeline

model_id = "BAAI/nova-d48w768-sdxl1024"
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = NOVAPipeline.from_pretrained(model_id, **model_args)
pipe = pipe.to("cuda")

prompt = "a shiba inu wearing a beret and black turtleneck."
image = pipe(prompt).images[0]
    
image.save("shiba_inu.jpg")

2.2 Text to Video

import torch
from diffnext.pipelines import NOVAPipeline
from diffnext.utils import export_to_image, export_to_video

model_id = "BAAI/nova-d48w1024-osp480"
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = NOVAPipeline.from_pretrained(model_id, **model_args)

# Standard device routine.
pipe = pipe.to("cuda")
# Use CPU model offload routine and expandable allocator if OOM.
# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
# pipe.enable_model_cpu_offload()

# Text to Video
prompt = "Many spotted jellyfish pulsating under water."
video = pipe(prompt, max_latent_length=9).frames[0]
export_to_video(video, "jellyfish.mp4", fps=12)

# Increase AR and diffusion steps for better video quality.
video = pipe(
  prompt,
  max_latent_length=9,
  num_inference_steps=128,  # default: 64
  num_diffusion_steps=100,  # default: 25
).frames[0]
export_to_video(video, "jellyfish_v2.mp4", fps=12)

# You can also generate images from text, with the first frame as an image.
prompt = "Many spotted jellyfish pulsating under water."
image = pipe(prompt, max_latent_length=1).frames[0, 0]
export_to_image(image, "jellyfish.jpg")

3. Gradio Demo

# For text-to-image demo
python scripts/app_nova_t2i.py --model "BAAI/nova-d48w1024-sdxl1024" --device 0

# For text-to-video demo
python scripts/app_nova_t2v.py --model "BAAI/nova-d48w1024-osp480" --device 0

4. Train

See Training Guide

5. Inference

See Inference Guide

6. Evaluation

See Evaluation Guide

📋Todo List

Citation

If you find this repository useful, please consider giving a star ⭐ and citation 🦖:

@article{deng2024nova,
  title={Autoregressive Video Generation without Vector Quantization},
  author={Deng, Haoge and Pan, Ting and Diao, Haiwen and Luo, Zhengxiong and Cui, Yufeng and Lu, Huchuan and Shan, Shiguang and Qi, Yonggang and Wang, Xinlong},
  journal={arXiv preprint arXiv:2412.14169},
  year={2024}
}

Acknowledgement

We thank the repositories: MAE, MAR, MaskGIT, DiT, Open-Sora-Plan, CogVideo, and CodeWithGPU.

License

Code and models are licensed under Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
diffnext		diffnext
docs		docs
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autoregressive Video Generation without Vector Quantization

🚀News

✨Hightlights

🗄️Model Zoo

Text to Image

Text to Video

📖Table of Contents

1. Installation

1.1 From Source

1.2 From Git

2. Quick Start

2.1 Text to Image

2.2 Text to Video

3. Gradio Demo

4. Train

5. Inference

6. Evaluation

📋Todo List

Citation

Acknowledgement

License

About

Releases

Packages

Contributors 4

Languages

License

baaivision/NOVA

Folders and files

Latest commit

History

Repository files navigation

Autoregressive Video Generation without Vector Quantization

🚀News

✨Hightlights

🗄️Model Zoo

Text to Image

Text to Video

📖Table of Contents

1. Installation

1.1 From Source

1.2 From Git

2. Quick Start

2.1 Text to Image

2.2 Text to Video

3. Gradio Demo

4. Train

5. Inference

6. Evaluation

📋Todo List

Citation

Acknowledgement

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages