Efficient Inference of Vision Instruction-Following Models with Elastic Cache

This repository contains PyTorch implementation for Elastic Cache (ECCV 2024).

Elastic Cache

Instruction encoding accounts for most of the theoretical computation cost, while the actual latency is negligible. This underscores that it’s not just model weights but also the KV cache used in output generation that can become a significant bottleneck.

We propose Elastic Cache through a Cache Merging based on the importance scores of instruction tokens, complemented by a fixed-point elimination strategy in the output generation phase. Our designs yield significant inference acceleration while maintaining generation quality.

Get Started

Environmental Setup:

We choose LLaVA-1.5 and Qwen-VL as our base model. You can install following dependencies for Elastic Cache evaluation:
```
pip install -r requirements.txt
```
Initial Weights:

We use LLaVA-1.5-7B, LLaVA-1.5-13B and Qwen-VL in our experiements, you may download these models and put them at /path/to/model
Download Eval Data:

You can download our pre-processed MM-Vet dataset here, and put it at ./playground/data/mm-vet. Our choosed LLaVA-Description datasets will come soon.

You can also prepare your own conversations for testing following the format in the json file.
Eval

Please refer to EVAL.md for the detailed instructions on evaluation, including generation, PPL evaluation, ROUGE evaluation, and latency test.

Quantitative and Qualitative Results

We evaluate Elastic Cache together with baselines (H2O and StreamingLLM) on PPL (lower better) and ROUGE (higher better) metrics. We conduct LLaVA-1.5 of different sizes (a),(b) and Qwen-VL-7B(c) for visual tasks. Our Elastic Cache outperforms baselines consistently.

Citation

If you found this repository useful, please consider citing:

@article{liu2024elastic,
title={Efficient Inference of Vision Instruction-Following Models with Elastic Cache},
author={Liu, Zuyan and Liu, Benlin and Wang, Jiahui and Dong, Yuhao and Chen, Guangyi and Rao, Yongming and Krishna, Ranjay and Lu, Jiwen},
journal={arXiv preprint arXiv:2407.18121},
year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
llava		llava
EVAL.md		EVAL.md
LICENSE		LICENSE
README.md		README.md
cache_generate.py		cache_generate.py
cache_generate_qwen.py		cache_generate_qwen.py
convert_rouge_llava.py		convert_rouge_llava.py
convert_rouge_qwen.py		convert_rouge_qwen.py
eval_generate.py		eval_generate.py
eval_latency.py		eval_latency.py
eval_ppl.py		eval_ppl.py
eval_ppl_qwen.py		eval_ppl_qwen.py
eval_rouge.py		eval_rouge.py
eval_rouge_qwen.py		eval_rouge_qwen.py
kv_cache.py		kv_cache.py
kv_cache_qwen.py		kv_cache_qwen.py
qwen_generation_utils.py		qwen_generation_utils.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Elastic Cache

Get Started

Quantitative and Qualitative Results

Citation

About

Releases

Packages

Languages

License

MachineLearningSystem/24ECCV-ElasticCache

Folders and files

Latest commit

History

Repository files navigation

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Elastic Cache

Get Started

Quantitative and Qualitative Results

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages