LLaVA-Unified

This repo is A Multimodal LLM Factory for Image and Video Understanding. It supports the training and deployment of Multimodal LLMs based on latest open-sourced LLMs like Llama-3.1/3.2 and Qwen2.5.

Models

Image Understanding Models

🤝 [LLaVA-Llama-3-8B]

Video Understanding Models

🤝 [LLaVA-Video-Qwen2.5-7B]

🤝 [LLaVA-Video-Llama-3.1-8B]

Updates

[12/17/2024] A new video-based MLLM LLaVA-Video-Qwen2.5-7B is released, with the SigLIP-g-384px as vision encoder and average pooling vision-language projector.
[8/11/2024] A completely new video-based MLLM LLaVA-Video-Llama-3.1-8B is released, with the SigLIP-g-384px as vision encoder and average pooling vision-language projector.
[6/4/2024] The codebase supports the video data fine-tuning for video understanding tasks.
[5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.

Install

If you are using Windows, do NOT proceed, see instructions here.

Setup

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Fine-Tune Your Own LLaVA-Video-Llama-3 Model

Please follow the updated fine-tuning script with DeepSpeed ZeRO-3: finetune.sh. The following parameters are updated to accomodate Llama-3:

--version: v3, which adopts the tokenization and preprocessing function with Llama-3 tokenizer.

Please download the pre-trained vision-language projector weights in Projector_MODEL.

In terms of the image data preparation, please follow DATA.md. The mixed SFT data with video instructions is available at video_data.

Demo with Gradio

CUDA_VISIBLE_DEVICES=0 python -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --share

Evaluation

TODO

Credits

This is a reproduction project, all research credits should be attributed to original authors for LLaVA. Please cite their papers listed below as well.

@misc{wang2024llavaunified,
  title={LLaVA-Unified: A Multimodal LLM Factory for Image and Video Understanding},
  author={Wang, Weizhi},
  year={2024}
}

@misc{wang2024llavavideollama3,
  title={LLaVA-Video-Llama-3: A Video Understanding Multimodal LLM based on Llama-3-8B LLM backbone},
  author={Wang, Weizhi},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
images		images
llava		llava
scripts		scripts
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
pyproject.toml		pyproject.toml
test_case.py		test_case.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaVA-Unified

Models

Image Understanding Models

Video Understanding Models

Updates

Install

Fine-Tune Your Own LLaVA-Video-Llama-3 Model

Demo with Gradio

Evaluation

Credits

About

Releases

Packages

Contributors 2

Languages

License

Victorwz/LLaVA-Unified

Folders and files

Latest commit

History

Repository files navigation

LLaVA-Unified

Models

Image Understanding Models

Video Understanding Models

Updates

Install

Fine-Tune Your Own LLaVA-Video-Llama-3 Model

Demo with Gradio

Evaluation

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages