Model Composition for Multimodal Large Language Models

This repo contains the codes for our paper Model Composition for Multimodal Large Language Models (ACL 2024).

Install

Clone this repository and navigate to ModelCompose folder

git clone https://github.com/THUNLP-MT/ModelCompose.git
cd ModelCompose

Install Package

conda create -n modelcompose python=3.10 -y
conda activate modelcompose
pip install -r requirements.txt

Preparation

Data

Before training or evaluation, please prepare the datasets based on your need. Json files can be downloaded from Hugging Face.

You can organize them under ./data as follows:

data
├── test
│   └── [json files]
├── train
│   └── [json files]
├── evaluation_datasets
├── audiocaps
│   ├── train
|   └── test
├── clotho
│   └── audio
├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
├── activitynet
├── PointCloud
│   └── 8192_npy
├── WavCaps
│   └── audios
|       ├── AudioSet_SL_flac
|       ├── BBC_Sound_Effects_flac
|       ├── FreeSound_flac
|       └── SoundBible_flac
└── vg
    ├── VG_100K
    └── VG_100K_2

Base model

We use vicuna-7b-v1.5 as our base model. Download

Train

We apply a two-stage training paradigm. Find pretrain and finetune scripts under ./scripts/model_composition/train. Please modify the following parameters in the scripts: BASE_PATH, ROOT, and MODEL_BASE.

Note that we use Video-LLaVA-Pretrain-7B as pretrained checkpoint for text-video modalities. Download pretrained checkpoints for Video-LLaVA-Pretrain-7B if needed.

We have released our trained checkpoints at Hugging Face.

Evaluation

Merge checkpoints

Seperately trained checkpoints should be merged before evaluation. Specify parameter adjustment coefficient in --strategy param starts with online-merge-reset-default-. Use vision, video, audio, point for each modality.

python scripts/model_composition/merge_unimodal_modelcompose.py \
                checkpoints/multimodal-vicuna-7b-v1.5-video-damc \
                checkpoints/multimodal-vicuna-7b-v1.5-audio-damc \
                checkpoints/multimodal-vicuna-7b-v1.5-vision-damc \
                -o checkpoints/multimodal-pdt-video-image-audio \
                --strategy online-merge-reset-default-video=0.333,default-audio=0.333,default-vision=0.333

Run evaluation

Note that the basename of the checkpoint should contain "multimodal" to load correctly. Replace "multimodal-checkpoint-name" in the following command with your merged checkpoint.

AVQA

bash scripts/model_composition/test/avqa.sh \
    0,1,2,3,4,5,6,7 \
    multimodal-checkpoint-name \
    [modal] \
    path/to/vicuna-7b-v1.5

Choose [modal] from [audio, image, video, image+audio, image+video, video+audio, video+image+audio].

Replace "0,1,2,3,4,5,6,7" with actual available gpu.

MUSIC-AVQA

bash scripts/model_composition/test/music_avqa_video+image+audio.sh \
    0,1,2,3,4,5,6,7 \
    multimodal-checkpoint-name \
    path/to/vicuna-7b-v1.5

Find scripts for other modalities combinations under scripts/model_composition/test.

MCUB

bash scripts/model_composition/test/MCUB-4.sh \
    0,1,2,3,4,5,6,7 \
    multimodal-checkpoint-name \
    path/to/vicuna-7b-v1.5

bash scripts/model_composition/test/MCUB-3.sh \
    0,1,2,3,4,5,6,7 \
    multimodal-checkpoint-name \
    [modal] \
    path/to/vicuna-7b-v1.5

Choose [modal] from [image-audio-video, audio-video-pointcloud, image-audio-pointcloud, image-video-pointcloud].

Citation

If you find our work useful, please consider giving this repository a star and citing our paper.

@inproceedings{chen-etal-2024-model,
    title = "Model Composition for Multimodal Large Language Models",
    author = "Chen, Chi  and
      Du, Yiyang  and
      Fang, Zheng  and
      Wang, Ziyue  and
      Luo, Fuwen  and
      Li, Peng  and
      Yan, Ming  and
      Zhang, Ji  and
      Huang, Fei  and
      Sun, Maosong  and
      Liu, Yang",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.606/",
    doi = "10.18653/v1/2024.acl-long.606",
    pages = "11246--11262",
}

Acknowledgement

LLaVA: the codebase we built upon, and it offers strong language & vision abilities.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
modelcompose		modelcompose
playground/data/prompts		playground/data/prompts
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo_app.py		demo_app.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model Composition for Multimodal Large Language Models

Contents

Install

Preparation

Train

Evaluation

Citation

Acknowledgement

About

Releases

Packages

Languages

License

THUNLP-MT/ModelCompose

Folders and files

Latest commit

History

Repository files navigation

Model Composition for Multimodal Large Language Models

Contents

Install

Preparation

Train

Evaluation

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages