Skip to content

Allegro is a powerful text-to-video model that generates high-quality videos up to 6 seconds at 15 FPS and 720p resolution from simple text input.

License

Notifications You must be signed in to change notification settings

rhymes-ai/Allegro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

80 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Gallery Β· Hugging Face Β· Blog Β· Paper Β· Discord

Allegro is a powerful text-to-video model that generates high-quality videos up to 6 seconds at 15 FPS and 720p resolution from simple text input. Allegro-TI2V, a variant of Allegro, extends this functionality by generating similar high-quality videos using text inputs along with first-frame and optionally last-frame image inputs.

News πŸ”₯

  • [2024/12/26] πŸš€ We release the low-resolution (40x360P) and fewer-frame (40x720P) models of Allegro for research purpose!

  • [2024/12/10] πŸš€ We release the training code for further training / fine-tuning!

  • [2024/11/25] πŸš€ Allegro-TI2V is open sourced!

  • [2024/10/30] πŸš€ We release multi-card inference code and PAB in Allegro-VideoSys. With VideoSys framework, the inference time can be further reduced to 3 mins (8xH100) and 2 mins (8xH100+PAB). We also opened a PR to the original VideoSys repo.

  • [2024/10/29] πŸŽ‰ Congratulations that Allegro is merged into diffusers! Currently Allegro is supported in 0.32.0-dev0. It will be integrated in the next release version. So for now, please use pip install git+https://github.com/huggingface/diffusers.git to install diffuser dev version. See huggingface for more details.

  • [2024/10/22]πŸš€ Allegro is open sourced!

Model Info

Model Allegro Allegro-TI2V
Description Text-to-Video Generation Model Text-Image-to-Video Generation Model
Download Hugging Face (88x720P)
Hugging Face (40x720P)
Hugging Face (40x360P)
Hugging Face (88x720P)
Parameter VAE: 175M
DiT: 2.8B
Inference Precision VAE: FP32/TF32/BF16/FP16 (best in FP32/TF32)
DiT/T5: BF16/FP32/TF32
Context Length 79.2K
Resolution 720 x 1280
Frames 88
Video Length 6 seconds @ 15 FPS
Single GPU Memory Usage 9.3G BF16 (with cpu_offload)
Inference time 20 mins (single H100) / 3 mins (8xH100)

Quick Start

Single Inference

Allegro

  1. Download the Allegro GitHub code.

  2. Install the necessary requirements.

    • Ensure Python >= 3.10, PyTorch >= 2.4, CUDA >= 12.4. For details, see requirements.txt.

    • It is recommended to use Anaconda to create a new environment (Python >= 3.10) to run the following example.

  3. Download the Allegro model weights.

  4. Run inference.

    python single_inference.py \
    --user_prompt 'A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this location might be a popular spot for docking fishing boats.' \
    --save_path ./output_videos/test_video.mp4 \
    --vae your/path/to/vae \
    --dit your/path/to/transformer \
    --text_encoder your/path/to/text_encoder \
    --tokenizer your/path/to/tokenizer \
    --guidance_scale 7.5 \
    --num_sampling_steps 100 \
    --seed 42

    Use --enable_cpu_offload to offload the model into CPU for less GPU memory cost (about 9.3G, compared to 27.5G if CPU offload is not enabled), but the inference time will increase significantly.

  5. (Optional) Interpolate the video to 30 FPS.

    It is recommended to use EMA-VFI to interpolate the video from 15 FPS to 30 FPS.

    For better visual quality, please use imageio to save the video.

Allegro TI2V

  1. Download the Allegro GitHub code.

  2. Install the necessary requirements.

    • Ensure Python >= 3.10, PyTorch >= 2.4, CUDA >= 12.4. For details, see requirements.txt.

    • It is recommended to use Anaconda to create a new environment (Python >= 3.10) to run the following example.

  3. Download the Allegro-TI2V model weights.

  4. Run inference.

    python single_inference_ti2v.py \
    --user_prompt 'The car drives along the road' \
    --first_frame your/path/to/first_frame_image.png \
    --vae your/path/to/vae \
    --dit your/path/to/transformer \
    --text_encoder your/path/to/text_encoder \
    --tokenizer your/path/to/tokenizer \
    --guidance_scale 8 \
    --num_sampling_steps 100 \
    --seed 1427329220

    The output video resolution is fixed at 720 Γ— 1280. Input images with different resolutions will be automatically cropped and resized to fit.

Argument Description
--user_prompt [Required] Text input for image-to-video generation.
--first_frame [Required] First-frame image input for image-to-video generation.
--last_frame [Optional] If provided, the model will generate intermediate video content based on the specified first and last frame images.
--enable_cpu_offload [Optional] Offload the model into CPU for less GPU memory cost (about 9.3G, compared to 27.5G if CPU offload is not enabled), but the inference time will increase significantly.
  1. (Optional) Interpolate the video to 30 FPS.

    It is recommended to use EMA-VFI to interpolate the video from 15 FPS to 30 FPS.

    For better visual quality, please use imageio to save the video.

Multi-Card Inference

For both Allegro & Allegro TI2V: We release multi-card inference code and PAB in Allegro-VideoSys.

Training / Fine-tuning

  1. Download the Allegro GitHub code, Allegro model weights and prepare the environment in requirements.txt.

  2. Our training code loads the dataset from .parquet files. We recommend first constructing a .jsonl file to store all data cases in a list. Each case should be stored as a dict, like this:

    [
        {"path": "foo/bar.mp4", "num_frames": 123, "height": 1080, "width": 1920, "cap": "This is a fake caption."}
        ...
    ]

    After that, run dataset_utils.py to convert .jsonl into .parquet.

    The absolute path to each video is constructed by joining args.data_dir in train.py with the path value from the dataset. Therefore, you may define path as a relative path within your dataset and set args.data_dir to the root dir when running training.

  3. Run Training / Fine-tuning:

    export OMP_NUM_THREADS=1
    export MKL_NUM_THREADS=1
    
    export WANDB_API_KEY=YOUR_WANDB_KEY
    
    accelerate launch \
        --num_machines 1 \
        --num_processes 8 \
        --machine_rank 0 \
        --config_file config/accelerate_config.yaml \
        train.py \
        --project_name Allegro_Finetune_88x720p \
        --dit_config /huggingface/rhymes-ai/Allegro/transformer/config.json \
        --dit /huggingface/rhymes-ai/Allegro/transformer/ \
        --tokenizer /huggingface/rhymes-ai/Allegro/tokenizer \
        --text_encoder /huggingface/rhymes-ai/Allegro/text_encoder \
        --vae /huggingface/rhymes-ai/Allegro/vae \
        --vae_load_mode encoder_only \
        --enable_ae_compile \
        --dataset t2v \
        --data_dir /data_root/ \
        --meta_file data.parquet \
        --sample_rate 2 \
        --num_frames 88 \
        --max_height 720 \
        --max_width 1280 \
        --hw_thr 1.0 \
        --hw_aspect_thr 1.5 \
        --dataloader_num_workers 10 \
        --gradient_checkpointing \
        --train_batch_size 1 \
        --gradient_accumulation_steps 1 \
        --max_train_steps 1000000 \
        --learning_rate 1e-4 \
        --lr_scheduler constant \
        --lr_warmup_steps 0 \
        --mixed_precision bf16 \
        --report_to wandb \
        --allow_tf32 \
        --enable_stable_fp32 \
        --model_max_length 512 \
        --cfg 0.1 \
        --checkpointing_steps 100 \
        --resume_from_checkpoint latest \
        --output_dir ./output/Allegro_Finetune_88x720p
  4. (Optional) To customize the model training arguments, you may create a .json file following config.json. Feel free to use our training code to train a video diffusion model from scratch.

Limitation

  • The model cannot render celebrities, legible text, specific locations, streets or buildings.

Future Plan

  • Multiple GPU inference and further speed up (PAB)
  • Text & Image-To-Video (TI2V) video generation
  • Training for T2V&TI2V
  • Motion-controlled video generation
  • Visual quality enhancement

Support

If you encounter any problems or have any suggestions, feel free to open an issue or send an email to huanyang@rhymes.ai.

Citation

Please consider citing our technical report if you find the code and pre-trained models useful for your project.

@article{allegro2024,
  title={Allegro: Open the Black Box of Commercial-Level Video Generation Model},
  author={Yuan Zhou and Qiuyue Wang and Yuxuan Cai and Huan Yang},
  journal={arXiv preprint arXiv:2410.15458},
  year={2024}
}

License

This repo is released under the Apache 2.0 License.

Disclaimer

The Allegro series models are provided on an "AS IS" basis, and we disclaim any liability for consequences or damages arising from your use. Users are kindly advised to ensure compliance with all applicable laws and regulations. This includes, but is not limited to, prohibitions against illegal activities and the generation of content that is violent, pornographic, obscene, or otherwise deemed non-safe, inappropriate, or illegal. By using these models, you agree that we shall not be held accountable for any consequences resulting from your use.

Acknowledgment

We extend our heartfelt appreciation for the great contribution to the open-source community, especially Open-Sora-Plan, as we build our diffusion transformer (DiT) based on Open-Sora-Plan v1.2.

  • Open-Sora-Plan: A project aims to create a simple and scalable repo, to reproduce Sora.
  • Open-Sora: An initiative dedicated to efficiently producing high-quality video.
  • ColossalAI: A powerful large model parallel acceleration and optimization system.
  • VideoSys: An open-source project that provides a user-friendly and high-performance infrastructure for video generation.
  • DiT: Scalable Diffusion Models with Transformers.
  • PixArt: An open-source DiT-based text-to-image model.
  • StabilityAI VAE: A powerful image VAE model.
  • CLIP: A powerful text-image embedding model.
  • T5: A powerful text encoder.
  • Playground: A state-of-the-art open-source model in text-to-image generation.
  • EMA-VFI: A video frame interpolation model.

About

Allegro is a powerful text-to-video model that generates high-quality videos up to 6 seconds at 15 FPS and 720p resolution from simple text input.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages