Skip to content

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

License

Notifications You must be signed in to change notification settings

fudan-generative-vision/hallo3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

10 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Jiahao Cui1โ€ƒ Hui Li1โ€ƒ Yun Zhan1โ€ƒ Hanlin Shang1โ€ƒ Kaihui Cheng1โ€ƒ Yuqi Ma1โ€ƒ Shan Mu1โ€ƒ
Hang Zhou2โ€ƒ Jingdong Wang2โ€ƒ Siyu Zhu1โœ‰๏ธโ€ƒ
1Fudan Universityโ€ƒ 2Baidu Incโ€ƒ


Hallo3-main.mp4

๐Ÿ“ธ Showcase

0001.mp4
0003.mp4
0008.mp4
0018.mp4
0022.mp4
0009.mp4

Visit our project page to view more cases.

๐Ÿ“ฐ News

  • 2025/01/27: ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ Release training data on HuggingFace. It includes over 70 hours of pure talking-head videos and more than 50 wild-scene video clips.

โš™๏ธ Installation

  • System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 12.1
  • Tested GPUs: H100

Download the codes:

  git clone https://github.com/fudan-generative-vision/hallo3
  cd hallo3

Create conda environment:

  conda create -n hallo python=3.10
  conda activate hallo

Install packages with pip

  pip install -r requirements.txt

Besides, ffmpeg is also needed:

  apt-get install ffmpeg

๐Ÿ“ฅ Download Pretrained Models

You can easily get all pretrained models required by inference from our HuggingFace repo.

Using huggingface-cli to download the models:

cd $ProjectRootDir
pip install huggingface-cli
huggingface-cli download fudan-generative-ai/hallo3 --local-dir ./pretrained_models

Or you can download them separately from their source repo:

Finally, these pretrained models should be organized as follows:

./pretrained_models/
|-- audio_separator/
|   |-- download_checks.json
|   |-- mdx_model_data.json
|   |-- vr_model_data.json
|   `-- Kim_Vocal_2.onnx
|-- cogvideox-5b-i2v-sat/
|   |-- transformer/
|       |--1/
|           |-- mp_rank_00_model_states.pt  
|       `--latest
|   `-- vae/
|           |-- 3d-vae.pt
|-- face_analysis/
|   `-- models/
|       |-- face_landmarker_v2_with_blendshapes.task  # face landmarker model from mediapipe
|       |-- 1k3d68.onnx
|       |-- 2d106det.onnx
|       |-- genderage.onnx
|       |-- glintr100.onnx
|       `-- scrfd_10g_bnkps.onnx
|-- hallo3
|   |--1/
|       |-- mp_rank_00_model_states.pt 
|   `--latest
|-- t5-v1_1-xxl/
|   |-- added_tokens.json
|   |-- config.json
|   |-- model-00001-of-00002.safetensors
|   |-- model-00002-of-00002.safetensors
|   |-- model.safetensors.index.json
|   |-- special_tokens_map.json
|   |-- spiece.model
|   |-- tokenizer_config.json
|   
`-- wav2vec/
    `-- wav2vec2-base-960h/
        |-- config.json
        |-- feature_extractor_config.json
        |-- model.safetensors
        |-- preprocessor_config.json
        |-- special_tokens_map.json
        |-- tokenizer_config.json
        `-- vocab.json

๐Ÿ› ๏ธ Prepare Inference Data

Hallo3 has a few simple requirements for the input data of inference:

  1. Reference image must be 1:1 or 3:2 aspect ratio.
  2. Driving audio must be in WAV format.
  3. Audio must be in English since our training datasets are only in this language.
  4. Ensure the vocals of audio are clear; background music is acceptable.

๐ŸŽฎ Run Inference

Gradio UI

To run the Gradio UI simply run hallo3/app.py:

python hallo3/app.py

Gradio Demo

Batch

Simply to run the scripts/inference_long_batch.sh:

bash scripts/inference_long_batch.sh ./examples/inference/input.txt ./output

Animation results will be saved at ./output. You can find more examples for inference at examples folder.

Training

Prepare data for training

Begin your data-exploration by downloading the training dataset from the HuggingFace Dataset Repo. This dataset contains over 70 hours of talking-head videos, focusing on the speaker's face and speech, and more than 50 wild-scene clips from various real-world settings. After downloading, simply unzip all the .tgz files to access the data and start your projects and organize them into the following directory structure:

dataset_name/
|-- videos/
|   |-- 0001.mp4
|   |-- 0002.mp4
|   `-- 0003.mp4
|-- caption/
|   |-- 0001.txt
|   |-- 0002.txt
|   `-- 0003.txt

You can use any dataset_name, but ensure the videos directory and caption directory are named as shown above.

Next, process the videos with the following commands:

bash scripts/data_preprocess.sh {dataset_name} {parallelism} {rank} {output_name}

Training

Update the data meta path settings in the configuration YAML files, configs/sft_s1.yaml and configs/sft_s2.yaml:

#sft_s1.yaml
train_data: [
    "./data/output_name.json"
]

#sft_s2.yaml
train_data: [
    "./data/output_name.json"
]

Start training with the following command:

# stage1
bash scripts/finetune_multi_gpus_s1.sh

# stage2
bash scripts/finetune_multi_gpus_s2.sh

๐Ÿ“ Citation

If you find our work useful for your research, please consider citing the paper:

@misc{cui2024hallo3,
	title={Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks}, 
	author={Jiahao Cui and Hui Li and Yun Zhan and Hanlin Shang and Kaihui Cheng and Yuqi Ma and Shan Mu and Hang Zhou and Jingdong Wang and Siyu Zhu},
	year={2024},
	eprint={2412.00733},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
}

โš ๏ธ Social Risks and Mitigations

The development of portrait image animation technologies driven by audio inputs poses social risks, such as the ethical implications of creating realistic portraits that could be misused for deepfakes. To mitigate these risks, it is crucial to establish ethical guidelines and responsible use practices. Privacy and consent concerns also arise from using individuals' images and voices. Addressing these involves transparent data usage policies, informed consent, and safeguarding privacy rights. By addressing these risks and implementing mitigations, the research aims to ensure the responsible and ethical development of this technology.

๐Ÿค— Acknowledgements

This model is a fine-tuned derivative version based on the CogVideo-5B I2V model. CogVideo-5B is an open-source text-to-video generation model developed by the CogVideoX team. Its original code and model parameters are governed by the CogVideo-5B LICENSE.

As a derivative work of CogVideo-5B, the use, distribution, and modification of this model must comply with the license terms of CogVideo-5B.

๐Ÿ‘ Community Contributors

Thank you to all the contributors who have helped to make this project better!

About

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published