This repo is A Multimodal LLM Factory for Image and Video Understanding. It supports the training and deployment of Multimodal LLMs based on latest open-sourced LLMs like Llama-3.1/3.2 and Qwen2.5.
🤝 [LLaVA-Llama-3-8B]
- [12/17/2024] A new video-based MLLM LLaVA-Video-Qwen2.5-7B is released, with the SigLIP-g-384px as vision encoder and average pooling vision-language projector.
- [8/11/2024] A completely new video-based MLLM LLaVA-Video-Llama-3.1-8B is released, with the SigLIP-g-384px as vision encoder and average pooling vision-language projector.
- [6/4/2024] The codebase supports the video data fine-tuning for video understanding tasks.
- [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.
If you are using Windows, do NOT proceed, see instructions here.
- Setup
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Please follow the updated fine-tuning script with DeepSpeed ZeRO-3: finetune.sh
. The following parameters are updated to accomodate Llama-3:
--version
: v3, which adopts the tokenization and preprocessing function with Llama-3 tokenizer.
Please download the pre-trained vision-language projector weights in Projector_MODEL.
In terms of the image data preparation, please follow DATA.md
. The mixed SFT data with video instructions is available at video_data
.
CUDA_VISIBLE_DEVICES=0 python -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --share
TODO
This is a reproduction project, all research credits should be attributed to original authors for LLaVA. Please cite their papers listed below as well.
@misc{wang2024llavaunified,
title={LLaVA-Unified: A Multimodal LLM Factory for Image and Video Understanding},
author={Wang, Weizhi},
year={2024}
}
@misc{wang2024llavavideollama3,
title={LLaVA-Video-Llama-3: A Video Understanding Multimodal LLM based on Llama-3-8B LLM backbone},
author={Wang, Weizhi},
year={2024}
}