GitHub - jingyi0000/Awesome-Visual-Instruction-Tuning: Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

This is the repository of Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey, a systematic review of visual instruction tuning. For details, please refer to:

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
[Paper]

Abstract

Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture, arising two limitations: (1) it leads to task-specific models, which require multiple models for different tasks and restrict the potential synergies from diverse tasks; (2) it leads to a pre-defined and fixed model interface that has limited interactivity and adaptability in following user' task instructions. To address them, Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions, aiming to learn from a wide range of vision tasks described by language instructions a general-purpose multimodal model that can follow arbitrary instructions and thus solve arbitrary tasks specified by the user. This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; (3) the commonly used datasets in visual instruction tuning and evaluation; (4) the review of existing VIT methods that categorizes them with a taxonomy according to both the studied vision task and the method design and highlights the major contributions, strengths, and shortcomings of them; (5) the comparison and discussion of VIT methods over various instruction-following benchmarks; (6) several challenges, open directions and possible future works in visual instruction tuning research.

Citation

If you find our work useful in your research, please consider citing:

@article{huang2023visual,
  title={Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey},
  author={Huang, Jiaxing and Zhang, Jingyi and Jiang, Kai and Qiu, Han and Lu, Shijian},
  journal={arXiv preprint arXiv:2312.16602},
  year={2023}
}

Datasets

Datasets for Visual Instruction Tuning

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
overview.png		overview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

Abstract

Citation

Menu

Datasets

Datasets for Visual Instruction Tuning

Datasets for Instruction-tuned Model Evaluation

Visual Instruction Tuning Methods

Instruction-based Image Learning

Instruction-based Image Learning for Discriminative Tasks

Instruction-based Image Learning for Generative Tasks

Instruction-based Image Learning for Complex Reasoning Tasks

Instruction-based Video Learning

Instruction-based 3D Vision Learning

Instruction-based Medical Vision Learning

Instruction-based Document Vision Learning

About

Releases

Packages

jingyi0000/Awesome-Visual-Instruction-Tuning

Folders and files

Latest commit

History

Repository files navigation

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

Abstract

Citation

Menu

Datasets

Datasets for Visual Instruction Tuning

Datasets for Instruction-tuned Model Evaluation

Visual Instruction Tuning Methods

Instruction-based Image Learning

Instruction-based Image Learning for Discriminative Tasks

Instruction-based Image Learning for Generative Tasks

Instruction-based Image Learning for Complex Reasoning Tasks

Instruction-based Video Learning

Instruction-based 3D Vision Learning

Instruction-based Medical Vision Learning

Instruction-based Document Vision Learning

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages