Skip to content

[arXiv 2024] Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

License

Notifications You must be signed in to change notification settings

hkchengrex/MMAudio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji

University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation

Note: This repository is still under construction. Single-example inference should work as expected. The training code will be added. Code is subject to non-backward-compatible changes.

Highlight

MMAudio generates synchronized audio given video and/or text inputs. Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets. Moreover, a synchronization module aligns the generated audio with the video frames.

Results

(All audio from our algorithm MMAudio)

Videos from Sora:

sora_v2_comp.mp4

Videos from Veo 2:

veo_results_lower_bitrate.mp4

Videos from MovieGen/Hunyuan Video/VGGSound:

results_concat.mp4

For more results, visit https://hkchengrex.com/MMAudio/video_main.html.

Update Logs

  • 2024-12-14: Removed the ffmpeg<7 requirement for the demos by replacing torio.io.StreamingMediaDecoder with pyav for reading frames. The read frames are also cached, so we are not reading the same frames again during reconstruction. This should speed things up and make installation less of a hassle.
  • 2024-12-13: Improved for-loop processing in CLIP/Sync feature extraction by introducing a batch size multiplier. We can approximately use 40x batch size for CLIP/Sync without using more memory, thereby speeding up processing. Removed VAE encoder during inference -- we don't need it.
  • 2024-12-11: Replaced torio.io.StreamingMediaDecoder with pyav for reading framerate when reconstructing the input video. torio.io.StreamingMediaDecoder does not work reliably in huggingface ZeroGPU's environment, and I suspect that it might not work in some other environments as well.

Installation

We have only tested this on Ubuntu.

Prerequisites

We recommend using a miniforge environment.

  • Python 3.9+
  • PyTorch 2.5.1+ and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)

1. Install prerequisite if not yet met:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade

(Or any other CUDA versions that your GPUs/driver support)

2. Clone our repository:

git clone https://github.com/hkchengrex/MMAudio.git

3. Install with pip (install pytorch first before attempting this!):

cd MMAudio
pip install -e .

(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)

Pretrained models:

The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in mmaudio/utils/download_utils.py. The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main

Model Download link File size
Flow prediction network, small 16kHz mmaudio_small_16k.pth 601M
Flow prediction network, small 44.1kHz mmaudio_small_44k.pth 601M
Flow prediction network, medium 44.1kHz mmaudio_medium_44k.pth 2.4G
Flow prediction network, large 44.1kHz mmaudio_large_44k.pth 3.9G
Flow prediction network, large 44.1kHz, v2 (recommended) mmaudio_large_44k_v2.pth 3.9G
16kHz VAE v1-16.pth 655M
16kHz BigVGAN vocoder (from Make-An-Audio 2) best_netG.pt 429M
44.1kHz VAE v1-44.pth 1.2G
Synchformer visual encoder synchformer_state_dict.pth 907M

To run the model, you need four components: a flow prediction network, visual feature extractors (Synchformer and CLIP, CLIP will be downloaded automatically), a VAE, and a vocoder. VAEs and vocoders are specific to the sampling rate (16kHz or 44.1kHz) and not model sizes. The 44.1kHz vocoder will be downloaded automatically.

The expected directory structure (full):

MMAudio
├── ext_weights
│   ├── best_netG.pt
│   ├── synchformer_state_dict.pth
│   ├── v1-16.pth
│   └── v1-44.pth
├── weights
│   ├── mmaudio_small_16k.pth
│   ├── mmaudio_small_44k.pth
│   ├── mmaudio_medium_44k.pth
│   ├── mmaudio_large_44k.pth
│   └── mmaudio_large_44k_v2.pth
└── ...

The expected directory structure (minimal, for the recommended model only):

MMAudio
├── ext_weights
│   ├── synchformer_state_dict.pth
│   └── v1-44.pth
├── weights
│   └── mmaudio_large_44k_v2.pth
└── ...

Demo

By default, these scripts use the large_44k_v2 model. In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs.

Command-line interface

With demo.py

python demo.py --duration=8 --video=<path to video> --prompt "your prompt" 

The output (audio in .flac format, and video in .mp4 format) will be saved in ./output. See the file for more options. Simply omit the --video option for text-to-audio synthesis. The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.

Gradio interface

Supports video-to-audio and text-to-audio synthesis. Use port forwarding (e.g., ssh -L 7860:localhost:7860 server) if necessary. The default port is 7860 which you can change in gradio_demo.py.

python gradio_demo.py

FAQ

  1. Video processing
    • Processing higher-resolution videos takes longer due to encoding and decoding, but it does not improve the quality of results.
    • The CLIP encoder resizes input frames to 384×384 pixels.
    • Synchformer resizes the shorter edge to 224 pixels and applies a center crop, focusing only on the central square of each frame.
  2. Frame rates
    • The CLIP model operates at 8 FPS, while Synchformer works at 25 FPS.
    • Frame rate conversion happens on-the-fly via the video reader.
    • For input videos with a frame rate below 25 FPS, frames will be duplicated to match the required rate.
  3. Failure cases As with most models of this type, failures can occur, and the reasons are not always clear. Below are some known failure modes. If you notice a failure mode or believe there’s a bug, feel free to open an issue in the repository.
  4. Performance variations We notice that there can be subtle performance variations in different hardware and software environments. Some of the reasons include using/not using torch.compile, video reader library/backend, inference precision, batch sizes, random seeds, etc. We (will) provide pre-computed results on standard benchmark for reference. Results obtained from this codebase should be similar but might not be exactly the same.

Known limitations

  1. The model sometimes generates unintelligible human speech-like sounds
  2. The model sometimes generates background music (without explicit training, it would not be high quality)
  3. The model struggles with unfamiliar concepts, e.g., it can generate "gunfires" but not "RPG firing".

We believe all of these three limitations can be addressed with more high-quality training data.

Training

Work in progress.

Evaluation

You can access the precomputed results on VGGSound, AudioCaps, and MovieGen here: https://huggingface.co/datasets/hkchengrex/MMAudio-precomputed-results

We have shared our evaluation code here: https://github.com/hkchengrex/av-benchmark

Training Datasets

MMAudio was trained on several datasets, including AudioSet, Freesound, VGGSound, AudioCaps, and WavCaps. These datasets are subject to specific licenses, which can be accessed on their respective websites. We do not guarantee that the pre-trained models are suitable for commercial use. Please use them at your own risk.

Citation

@inproceedings{cheng2024taming,
  title={Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis},
  author={Cheng, Ho Kei and Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Schwing, Alexander and Mitsufuji, Yuki},
  booktitle={arXiv},
  year={2024}
}

Acknowledgement

Many thanks to: