School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
*Corresponding author
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2024
[Paper] [Project Page] [Video(YouTube)] [Video(bilibili)]
🔥 Details will be released. Stay tuned 🍻 👍
- [07/2024] Code and checkpoints are released.
- [02/2024] LION has been accepted by CVPR 2024.
- [11/2023] Arxiv paper released.
- [11/2023] Project page released.
This is the github repository of LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge. In this work, we enhance MLLMs by integrating fine-grained spatial-aware visual knowledge and high-level semantic visual evidence, boosting capabilities and alleviating hallucinations.
The framework of the proposed LION model:
git clone https://github.com/JiuTian-VL/JiuTian-LION.git
cd JiuTian-LION
conda create -n LION python=3.12
conda activate LION
conda install pip
pip install -r requirements.txt
Version | Checkpoint |
---|---|
LION-FlanT5-XL | daybreaksly/LION-FlanT5-XL |
LION-FlanT5-XXL | daybreaksly/LION-FlanT5-XXL |
- Download the pre-trained vit model eva_vit_g.
- Download the pre-trained RAM model ram_swin_large_14m.
- Download the pre-trained FlanT5 model FlanT5-XL.
- Download the pre-trained BERT model bert-base-uncased
- Fill in the paths to these models into the corresponding locations in the config file
configs\models\lion_flant5xl.yaml
We provide inference examples for Image-Level and Region-Level tasks in playground.ipynb
.
For image-level tasks, we focus on image captioning and Visual Question Answering (VQA). For region-level tasks, we evaluate LION on three REC datasets including RefCOCO, RefCOCO+ and RefCOCOg. The results, detailed in Table 1~2, highlight LION's superior performance compared to baseline models.
We further evaluate LION on a object hallucination benchmark(POPE) and the most popular MLLM benchmark (MMBench). The results in Table 1~2 show that LION has strong performances across various skills and also demonstrates a strong resistance to hallucinations, particularly in popular and adversarial settings in POPE.
If you find this work useful for your research, please kindly cite our paper:
@inproceedings{chen2024lion,
title={LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge},
author={Chen, Gongwei and Shen, Leyang and Shao, Rui and Deng, Xiang and Nie, Liqiang},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}