Skip to content

The official implement of "Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings"

Notifications You must be signed in to change notification settings

DoubtedSteam/DyVTE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Qiong Wu1, Wenhao Lin1, Weihao Ye1, Yiyi Zhou1, Xiaoshuai Sun1, Rongrong Ji1

1Media Analytics and Computing Lab, Department of Artificial Intelligence, School of Informatics, Xiamen University

Abstract

The excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. To gain insights into this problem, we first conduct extensive empirical studies on the attention behaviors of MLLMs, and summarize three main inference stages in MLLMs: (i) Early fusion between tokens is first accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii) Multimodal reasoning resumes and lasts until the end of inference. In particular, we reveal that visual tokens will stop contributing to reasoning when the text tokens receive enough image information, yielding obvious visual redundancy. Based on these generalized observations, we propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer, thereby addressing the observed visual redundancy. To validate VTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL, and conduct extensive experiments on a bunch of benchmarks. The experiment results not only show the effectiveness of our VTE in improving MLLMs' efficiency, but also yield the general modeling patterns of MLLMs, well facilitating the in-depth understanding of MLLMs.

Install

  1. Install Package
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

If met any problem, please refer to the installtion of LLaVA.

Training

  1. Init weight
python init_weight.py
  1. Start Training
bash scripts/v1_5/finetune_visual_gate.sh

Evaluation

Please refer to the evaluation of LLaVA.

For example, to test on the gqa dataset:

bash scripts/v1_5/eval/gqa.sh

About

The official implement of "Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published