Skip to content

Victorwz/LLaVA-Unified

Repository files navigation

LLaVA-Unified

This repo is A Multimodal LLM Factory for Image and Video Understanding. It supports the training and deployment of Multimodal LLMs based on latest open-sourced LLMs like Llama-3.1/3.2 and Qwen2.5.

Models

Image Understanding Models

🤝 [LLaVA-Llama-3-8B]

Video Understanding Models

🤝 [LLaVA-Video-Qwen2.5-7B]

🤝 [LLaVA-Video-Llama-3.1-8B]

Updates

  • [12/17/2024] A new video-based MLLM LLaVA-Video-Qwen2.5-7B is released, with the SigLIP-g-384px as vision encoder and average pooling vision-language projector.
  • [8/11/2024] A completely new video-based MLLM LLaVA-Video-Llama-3.1-8B is released, with the SigLIP-g-384px as vision encoder and average pooling vision-language projector.
  • [6/4/2024] The codebase supports the video data fine-tuning for video understanding tasks.
  • [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.

Install

If you are using Windows, do NOT proceed, see instructions here.

  1. Setup
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Fine-Tune Your Own LLaVA-Video-Llama-3 Model

Please follow the updated fine-tuning script with DeepSpeed ZeRO-3: finetune.sh. The following parameters are updated to accomodate Llama-3:

  • --version: v3, which adopts the tokenization and preprocessing function with Llama-3 tokenizer.

Please download the pre-trained vision-language projector weights in Projector_MODEL.

In terms of the image data preparation, please follow DATA.md. The mixed SFT data with video instructions is available at video_data.

Demo with Gradio

CUDA_VISIBLE_DEVICES=0 python -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --share

Evaluation

TODO

Credits

This is a reproduction project, all research credits should be attributed to original authors for LLaVA. Please cite their papers listed below as well.

@misc{wang2024llavaunified,
  title={LLaVA-Unified: A Multimodal LLM Factory for Image and Video Understanding},
  author={Wang, Weizhi},
  year={2024}
}
@misc{wang2024llavavideollama3,
  title={LLaVA-Video-Llama-3: A Video Understanding Multimodal LLM based on Llama-3-8B LLM backbone},
  author={Wang, Weizhi},
  year={2024}
}