Skip to content

NOVA: Autoregressive Video Generation without Vector Quantization

License

Notifications You must be signed in to change notification settings

baaivision/NOVA

Repository files navigation

Autoregressive Video Generation without Vector Quantization

ArXiv T2IDemo T2VDemo Webpage

Haoge Deng1,4*, Ting Pan2,4*, Haiwen Diao3,4*, Zhengxiong Luo4*, Yufeng Cui4
Huchuan Lu3, Shiguang Shan2, Yonggang Qi1, Xinlong Wang4†

BUPT1, ICT-CAS2, DLUT3, BAAI4
* Equal Contribution, Corresponding Author

We present NOVA (NOn-Quantized Video Autoregressive Model), a model that enables autoregressive image/video generation with high efficiency. NOVA reformulates the video generation problem as non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. NOVA generalizes well and enables diverse zero-shot generation abilities in one unified model.

🚀News

✨Hightlights

  • 🔥 Novel Approach: Non-quantized video autoregressive generation.
  • 🔥 State-of-the-art Performance: High efficiency with state-of-the-art t2i/t2v results.
  • 🔥 Unified Modeling: Multi-task capabilities in a single unified model.

🗄️Model Zoo

See detailed description in Model Zoo

Text to Image

Model Parameters Resolution Data Weight GenEval DPGBench
NOVA-0.6B 0.6B 512x512 16M 🤗 HF link 0.75 81.76
NOVA-0.3B 0.3B 1024x1024 600M 🤗 HF link 0.67 80.60
NOVA-0.6B 0.6B 1024x1024 600M 🤗 HF link 0.69 82.25
NOVA-1.4B 1.4B 1024x1024 600M 🤗 HF link 0.71 83.01

Text to Video

Model Parameters Resolution Data Weight VBench
NOVA-0.6B 0.6B 33x768x480 20M 🤗 HF link 80.12

📖Table of Contents

1. Installation

1.1 From Source

Clone this repository to local disk and install:

pip install diffusers transformers accelerate imageio[ffmpeg]
git clone https://github.com/baaivision/NOVA.git
cd NOVA && pip install .

1.2 From Git

You can also install from the remote repository if you have set your Github SSH key:

pip install diffusers transformers accelerate imageio[ffmpeg]
pip install git+ssh://git@github.com/baaivision/NOVA.git

2. Quick Start

2.1 Text to Image

import torch
from diffnext.pipelines import NOVAPipeline

model_id = "BAAI/nova-d48w768-sdxl1024"
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = NOVAPipeline.from_pretrained(model_id, **model_args)
pipe = pipe.to("cuda")

prompt = "a shiba inu wearing a beret and black turtleneck."
image = pipe(prompt).images[0]
    
image.save("shiba_inu.jpg")

2.2 Text to Video

import torch
from diffnext.pipelines import NOVAPipeline
from diffnext.utils import export_to_image, export_to_video

model_id = "BAAI/nova-d48w1024-osp480"
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = NOVAPipeline.from_pretrained(model_id, **model_args)

# Standard device routine.
pipe = pipe.to("cuda")
# Use CPU model offload routine and expandable allocator if OOM.
# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
# pipe.enable_model_cpu_offload()

# Text to Video
prompt = "Many spotted jellyfish pulsating under water."
video = pipe(prompt, max_latent_length=9).frames[0]
export_to_video(video, "jellyfish.mp4", fps=12)

# Increase AR and diffusion steps for better video quality.
video = pipe(
  prompt,
  max_latent_length=9,
  num_inference_steps=128,  # default: 64
  num_diffusion_steps=100,  # default: 25
).frames[0]
export_to_video(video, "jellyfish_v2.mp4", fps=12)

# You can also generate images from text, with the first frame as an image.
prompt = "Many spotted jellyfish pulsating under water."
image = pipe(prompt, max_latent_length=1).frames[0, 0]
export_to_image(image, "jellyfish.jpg")

3. Gradio Demo

# For text-to-image demo
python scripts/app_nova_t2i.py --model "BAAI/nova-d48w1024-sdxl1024" --device 0

# For text-to-video demo
python scripts/app_nova_t2v.py --model "BAAI/nova-d48w1024-osp480" --device 0

4. Train

5. Inference

6. Evaluation

📋Todo List

  • Model zoo
  • Quick Start
  • Gradio Demo
  • Inference guide
  • Finetuning code
  • Training code
  • Evaluation code
  • Prompt Writer
  • Larger model size
  • Additional downstream tasks: Image editing, Video editing, Controllable generation

Citation

If you find this repository useful, please consider giving a star ⭐ and citation 🦖:

@article{deng2024nova,
  title={Autoregressive Video Generation without Vector Quantization},
  author={Deng, Haoge and Pan, Ting and Diao, Haiwen and Luo, Zhengxiong and Cui, Yufeng and Lu, Huchuan and Shan, Shiguang and Qi, Yonggang and Wang, Xinlong},
  journal={arXiv preprint arXiv:2412.14169},
  year={2024}
}

Acknowledgement

We thank the repositories: MAE, MAR, MaskGIT, DiT, Open-Sora-Plan, CogVideo, and CodeWithGPU.

License

Code and models are licensed under Apache License 2.0.

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages