This is the repository of Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey, a systematic review of visual instruction tuning. For details, please refer to:
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
[Paper]
Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture, arising two limitations: (1) it leads to task-specific models, which require multiple models for different tasks and restrict the potential synergies from diverse tasks; (2) it leads to a pre-defined and fixed model interface that has limited interactivity and adaptability in following user' task instructions. To address them, Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions, aiming to learn from a wide range of vision tasks described by language instructions a general-purpose multimodal model that can follow arbitrary instructions and thus solve arbitrary tasks specified by the user. This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; (3) the commonly used datasets in visual instruction tuning and evaluation; (4) the review of existing VIT methods that categorizes them with a taxonomy according to both the studied vision task and the method design and highlights the major contributions, strengths, and shortcomings of them; (5) the comparison and discussion of VIT methods over various instruction-following benchmarks; (6) several challenges, open directions and possible future works in visual instruction tuning research.
If you find our work useful in your research, please consider citing:
@article{huang2023visual,
title={Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey},
author={Huang, Jiaxing and Zhang, Jingyi and Jiang, Kai and Qiu, Han and Lu, Shijian},
journal={arXiv preprint arXiv:2312.16602},
year={2023}
}