Zuyan Liu*,1 Yuhao Dong*,1 Yongming Rao2,✉ Jie Zhou1 Jiwen Lu1,✉
1Tsinghua University 2Tencent * Equal Contribution ✉ Corresponding Author
Project Page | Arxiv Paper | Huggingface Model
Chain-of-Spot (CoS) encourages Large Vision-Language Models to identify the key region of interest (ROI) in the image condition on the posed questions or instructions, and reasoning through an interactive manner.
This technique allows VLMs to access more detailed visual information without altering the original image resolution, thereby offering multi-granularity image features and improving the ability of visual understanding.
[2024-03]
- 🤗 Introducing our project homepage: https://sites.google.com/view/chain-of-spot
- 🤗 Check our paper introducing Chain-of-Spot in details.
- 🤗 Check our model on huggingface.
-
Environmental Setup: We choose LLaVA-1.5 as our base model. You can run the following scripts to set-up your environment for Chain-of-Spot evaluation:
git clone https://github.com/dongyh20/Chain-of-Spot.git cd Chain-of-Spot conda create -n cos python=3.10 -y conda activate cos pip install -e .
For the Chain-of-Spot fine-tuning from LLaVA-1.5, please follow the following scripts:
pip install -e ".[train]" pip install flash-attn --no-build-isolation
-
Initial Weights: We use LLaVA-1.5-7B and LLaVA-1.5-13B for finetuning, you may download these models and put them in the
./checkpoint
folder. -
Download Data: The dataset structure is the same as used in LLaVA, and we provide json files to modify original LLaVA training dataset into our dataset in the following part. To correctly download the data, please check the instructions.
After downloading all of them, organize the data as follows in
./playground/data
├── coco │ └── train2017 ├── gqa │ └── images ├── ocr_vqa │ └── images ├── textvqa │ └── train_images └── vg ├── VG_100K └── VG_100K_2
-
Training Data Preparations: We migrate the brilliant work of LRP++ to detect the correct ROI corresponding to a single question or instruction. You can directly download our generated dataset to reproduce our results from Google Drive. You may also follow the Notebook to prepare your own data.
-
Evaluations on Various Benchmarks: We follow the Evaluation Docs in LLaVA to conduct our experiments. If you find it laborious and complex, please check LMMs-Eval for faster evaluation.
-
Start Training! The finetuning process takes around 20 hours on 8*A100 (80G) for LLaVA-1.5-13B. We finetune LLaVA-1.5 using Deepspeed Zero-3, you can directly run the scripts to launch training:
bash ./scripts/v1_5/finetune_CoS_13b.sh
Contact: Leave issue or contact liuzuyan19@gmail.com
and dongyh20@mails.tsinghua.edu.cn
. We are on call to respond.
Our Chain-of-Spot (CoS) consistently improves the vanilla LLaVA-1.5 in all the benchmarks under different language model sizes. The best results are highlighted bold.
Method | Language | VQA-v2 | GQA | VizWiz | SQA | Text-VQA | OKVQA |
---|---|---|---|---|---|---|---|
LLaVA-1.5-7B | Vicuna-7B | 78.5 | 62.0 | 50.0 | 66.8 | 58.2 | 57.9 |
LLaVA-1.5-7B + CoS | Vicuna-7B | 80.7 | 63.7 | 50.8 | 68.2 | 60.9 | 58.4 |
LLaVA-1.5-13B | Vicuna-13B | 80.0 | 63.3 | 53.6 | 71.6 | 61.3 | 60.9 |
LLaVA-1.5-13B + CoS | Vicuna-13B | 81.8 | 64.8 | 58.0 | 71.9 | 62.4 | 62.9 |
LLaVA-1.5 with Chain-of-Spot (CoS) a achieves state-of-the-art performance on all the multimodal benchmarks, surpassing LVLMs by a large margin. The best results are highlighted bold.
Method | Language | SEED | SEED_Img | MME | MMB | POPE | MM-Vet |
---|---|---|---|---|---|---|---|
LLaVA-1.5-7B | Vicuna-7B | 58.6 | 66.1 | 1510.7 | 64.3 | 85.9 | 30.5 |
LLaVA-1.5-7B + CoS | Vicuna-7B | 59.7 | 67.1 | 1501.1 | 64.4 | 86.4 | 30.8 |
LLaVA-1.5-13B | Vicuna-13B | 61.6 | 68.2 | 1531.3 | 67.7 | 85.9 | 35.4 |
LLaVA-1.5-13B + CoS | Vicuna-13B | 62.3 | 69.6 | 1546.1 | 68.2 | 86.1 | 37.6 |
Visualizations on Chain-of-Spot. Chain-of-Spot shows the reasonable region of interest condition on the given questions.
Generation comparisons after implementing Chain-of-Spot. Chain-of-Spot corrects the focus and the answers of LLaVA model on complex visual question cases.
If you found this repository useful, please consider citing:
@article{liu2024chain,
title={Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models},
author={Liu, Zuyan and Dong, Yuhao and Rao, Yongming and Zhou, Jie and Lu, Jiwen},
journal={arXiv preprint arXiv:2403.12966},
year={2024}
}
We thank the LLaVA team for their great contribution to the open-source VLM community.