Chen Li1, Yixiao Ge1, Dian Li2, and Ying Shan1.
1ARC Lab, Tencent PCG
2Foundation Technology Center, Tencent PCG
This paper is a review of all the works related to vision-language instruction tuning (VLIT). We will periodically update the recent public VLIT dataset and the VLIT data constructed by the pipeline in this paper.
- Release New Vision-Language Instruction Data (periodically) ...
- Update Public VLIT Datasets and Related Work (periodically) ...
- Release Construction Tools
- [2023.11.16] Release Instruction Data
- [2023.11.15] Paper Released (ArXiv)
Currently, the existing VLIT generation schemes can be divided into two categories, among which Annotation Adaption mainly relies on directly adjusting and rewriting the existing annotation data to adapt to the VLIT data template. Self-Instruct relies on the Large Language Model (LLM) to synthesize annotation data from more sources and reorganize it to generate VLIT data with more diversity and complexity (of course, it also brings more noise and hallucination).
VLIT Data
ββ General Instruction
β ββ Annotation Adaption
β ββ Self-Instruct
ββ Specific Instruction
β ββ Object/Task-Specific
β β ββ Region
β β ββ Video
β β ββ Text
β ββ Domain-Specific
β ββ Medicine
β ββ Document
β ββ PointCloud
ββ Construction Tools
ββ Data Mixing
If there is any missing, please notify us by email(palchenli@tencent.com) and we will update as soon as possible.
In this paper, we propose a vision-language instruction construction pipeline and generate a corresponding dataset. Specifically, the generated instruction data is a multi-round question answering about a given image. Here are some examples of the generated instruction data:
There are three different types of instruction data. The data statistics and download links are as follows.
Data Type | Baidu CLoud | Google Drive | Huggingface |
---|---|---|---|
COCO_2014_Images | url | url | url |
Global | url | url | url |
Negative | url | url | url |
Region | url | url | url |
Region_Images | url | url | url |
{
"image_source": "",
"construction_time": "",
"annotations": [
{
"img_ids": "",
"instruction_type": "",
"conversations": []
},
{
"img_ids": "",
"instruction_type": "",
"conversations": []
}
]
}
If you found this repository useful, please consider citing:
@article{li2023visionlanguage,
title={Vision-Language Instruction Tuning: A Review and Analysis},
author={Chen Li and Yixiao Ge and Dian Li and Ying Shan},
year={2023},
eprint={2311.08172},
archivePrefix={arXiv},
primaryClass={cs.MM}
}
We would like to thank LLaVA, LAVIS and OpenFlamingo for their well-architcated multi-modal LLMs. Thanks to SEED-Bench for being an open source and convenient benchmark for evaluating MLLMs.