Skip to content

palchenli/VL-Instruction-Tuning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Vision-Language Instruction Tuning: A Review and Analysis


Chen Li1, Yixiao Ge1, Dian Li2, and Ying Shan1.

1ARC Lab, Tencent PCG
2Foundation Technology Center, Tencent PCG

Oryx Video-ChatGPT

This paper is a review of all the works related to vision-language instruction tuning (VLIT). We will periodically update the recent public VLIT dataset and the VLIT data constructed by the pipeline in this paper.


πŸ“† Schedule

  • Release New Vision-Language Instruction Data (periodically) ...
  • Update Public VLIT Datasets and Related Work (periodically) ...
  • Release Construction Tools
  • [2023.11.16] Release Instruction Data
  • [2023.11.15] Paper Released (ArXiv)

🏷️ Catalogue

  1. Existing VLIT Data
  2. VLIT Data Constructed in This Paper

πŸ—’οΈ Existing VLIT Dataset

Currently, the existing VLIT generation schemes can be divided into two categories, among which Annotation Adaption mainly relies on directly adjusting and rewriting the existing annotation data to adapt to the VLIT data template. Self-Instruct relies on the Large Language Model (LLM) to synthesize annotation data from more sources and reorganize it to generate VLIT data with more diversity and complexity (of course, it also brings more noise and hallucination).

VLIT Data
β”œβ”€ General Instruction
β”‚   β”œβ”€ Annotation Adaption
β”‚   └─ Self-Instruct
β”œβ”€ Specific Instruction
β”‚   β”œβ”€ Object/Task-Specific
β”‚   β”‚   β”œβ”€ Region
β”‚   β”‚   β”œβ”€ Video
β”‚   β”‚   └─ Text
β”‚   └─ Domain-Specific
β”‚       β”œβ”€ Medicine
β”‚       β”œβ”€ Document
β”‚       └─ PointCloud
β”œβ”€ Construction Tools
└─ Data Mixing

Dataset

If there is any missing, please notify us by email(palchenli@tencent.com) and we will update as soon as possible.

Dataset MLLM Paper
... ... ...
ShareGPT4V - ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
LVLM_NLF - DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
LVIS-INSTRUCT4V - To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
GranD GLaMM GLaMM: Pixel Grounding Large Multimodal Model
ComVint - What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
MiniGPT-v2 MiniGPT-v2 MiniGPT-v2: Large Language Model As a Unified Interface for Vision-Language Multi-task Learning
GRIT Ferret FERRET REFER AND GROUND ANYTHING ANYWHERE AT ANY GRANULARITY
SparklesDialogue-VG SparklesChat Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
SparklesDialogue-CC SparklesChat Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
MMC-Instruction - MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
InternLM-XComposer InternLM-XComposer InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
AnyMAL AnyMAL AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
DreamLLM DreamLLM DREAMLLM: SYNERGISTIC MULTIMODAL COMPREHENSION AND CREATION
TextBind TextBind TEXTBIND: Multi-turn Interleaved Multimodal Instruction-following in the Wild
PVIT PVIT Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
T2M NExT-GPT NExT-GPT: Any-to-Any Multimodal LLM
MosIT NExT-GPT NExT-GPT: Any-to-Any Multimodal LLM
GPTVQA MLLM-DataEngine MLLM-DataEngine: An Iterative Refinement Approach for MLLM
LrvInstruction - Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
CIEM - CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning
PointLLM PointLLM PointLLM: Empowering Large Language Models to Understand Point Clouds
VIGC VIGC VIGC: Visual Instruction Generation and Correction
M-HalDetec - Detecting and Preventing Hallucinations in Large Vision Language Models
StableLLaVA StableLLaVA StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
I4 Cheetor EMPOWERING VISION-LANGUAGE MODELS TO FOLLOW INTERLEAVED VISION-LANGUAGE INSTRUCTIONS
AS-1B ASM The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Multimodal_id_v1 LMEye(IPN) LMEye: An Interactive Perception Network for Large Language Models
Lynx Lynx What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
MGVLID ChatSpot ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
BuboGPT BuboGPT BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
GRIT-20M KOSMOS-2 KOSMOS-2: Grounding Multimodal Large Language Models to the World
SVIT SVIT(MMLLM) SVIT: Scaling up Visual Instruction Tuning
GPT4RoI GPT4RoI GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
PF-1M Clever Flamingo Visual Instruction Tuning with Polite Flamingo
Shikra-RD Shikra Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic
LLaVAR LLaVAR LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
OphGLM OphGLM OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue
LAMM LAMM LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
MACAW-LLM MACAW-LLM Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
InstructBLIP InstructBLIP InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
MultiModal-GPT MultiModal-GPT MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
Valley-Instruct-73 VALLEY VALLEY: VIDEO ASSISTANT WITH LARGE LANGUAGE MODEL ENHANCED ABILITY
Video-LLaMA Video-LLaMA Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
MULTIINSTRUCT OFA(multiinstruct) MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Video-ChatGPT Video-ChatGPT Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
MIMIC-IT Otter MIMIC-IT: Multi-Modal In-Context Instruction Tuning
M3IT Ying-VLM M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
GPT4Tools GPT4Tools GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
PMC-VQA MedVInT-TE/TD PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
pandagpt_vid PandaGPT PandaGPT: One Model To Instruction-Follow Them All
MULTIS ChatBridge ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
DetGPT DetGPT DetGPT: Detect What You Need via Reasoning
X-LLM X-LLM X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
VideoChat VideoChat Videochat: Chat-centric video understanding
cc_sbu_align Mini-GPT4 Minigpt-4: Enhancing vision-language understanding with advanced large language models
LLaVA_instruct_150K LLaVA Visual Instruction Tuning

πŸ“‚ Multi-Modal Instruction Data Constructed in This Paper

In this paper, we propose a vision-language instruction construction pipeline and generate a corresponding dataset. Specifically, the generated instruction data is a multi-round question answering about a given image. Here are some examples of the generated instruction data:

Examples

Image Conversation
User: Can you describe the scene in the image?
AI: Sure, the image captures a large grizzly bear wading through a moving river. The bear has a brown coat and appears to be quite large and furry. It's currently not holding anything in its mouth and it doesn't seem to have caught any fish from the river.
User: Are grizzly bears typically found in such environments?
AI: Yes, indeed. Grizzly bears are often found near rivers or streams. They are excellent swimmers and often go into the water in search of food. During salmon spawning seasons, they are known to frequent river areas to catch the fish.
User: Is the bear possibly looking for food?
AI: That's a strong possibility. While there are no fish present in the river in the image, grizzly bears are known to forage for various food sources in rivers, including fish and other aquatic organisms. However, without further context, it's hard to say definitively.
User: What other characteristics does a grizzly bear have?
AI: Grizzly bears, which are a subspecies of brown bear, are known for their size, strength, and excellent swimming ability. They have a distinct hump over their shoulders which is a muscle mass used for digging and clawing. Their fur color can vary from very light tan to dark brown, with the fur on their back often being darker than the rest of their body. Also, despite their large size, grizzlies are quite fast and can run at speeds of up to 30 miles per hour.
User: What is the main animal in the picture?
AI: The main animal in the picture is a giraffe.
User: What is the giraffe doing?
AI: The giraffe is standing next to a tree and some rocks. It's not doing much, just hanging around.
User: Is the giraffe standing on the rocks?
AI: No, the giraffe is not standing on the rocks. It is standing next to them.
User: Can you describe the setting in more detail?
AI: The picture mainly focuses on the giraffe, a tree, and some rocks. The giraffe is standing right next to the tree. There are rocks scattered around them.
User: Are there any elephants in the picture?
AI: No, there are no elephants in this picture. Only a single giraffe is present.
User: Are there any unique features of the giraffe?
AI: As per the details provided in the image, the giraffe is standing and has a spotted pattern on its skin, which is a characteristic feature of giraffes.
User: Is the giraffe sitting on the rocks?
AI: No, the giraffe is not sitting on the rocks. It's standing by them.
User: Is the giraffe eating the tree leaves?
AI: There's no indication in the image that the giraffe is eating the leaves of the tree. It's simply standing beside it.
User: Where do giraffes usually live?
AI: Giraffes are native to Africa and they usually inhabit savannas, grasslands, and open woodlands. However, in this particular image, the exact location isn't specified.
User: Is there a car in the picture?
AI: Yes, there is a car in the picture and it appears to be parked.
User: What can you tell me about the car?
AI: Based on the image, the car is parked, which suggests that it is not currently in use. However, I can't provide specifics about its make, model, or color without more visual information.
User: Do you think the car is related to the boy on the skateboard?
AI: It's hard to definitively state a connection between the car and the boy on the skateboard based on the image alone. The car may belong to someone who is watching the boy, or it may just be parked in the vicinity where the boy is skateboarding.
User: Is the car moving?
AI: No, the car isn't moving. It's parked.
User: Could the car potentially pose a danger to the skateboarder?
AI: Since the car is parked, it's unlikely to pose a direct threat to the skateboarder. However, any vehicle, parked or moving, in the vicinity of skateboarders could potentially become a hazard, especially if the skateboarder loses control. This emphasizes the importance of skateboarding in safe, designated areas.
User: Is the car possibly there for a skateboarding event?
AI: Based on the information available, I can't definitely say if the car is there for a skateboarding event. It could belong to a spectator or participant, or it could be unrelated to the skateboarding activity. More context would be needed to make a clear conclusion.

There are three different types of instruction data. The data statistics and download links are as follows.

Download Links

Data Type Baidu CLoud Google Drive Huggingface
COCO_2014_Images url url url
Global url url url
Negative url url url
Region url url url
Region_Images url url url

Data Format

{
    "image_source": "",
    "construction_time": "",
    "annotations": [
      {
        "img_ids": "",
        "instruction_type": "",
        "conversations": []
      },
      
      {
        "img_ids": "",
        "instruction_type": "",
        "conversations": []
      }
    ]
}

πŸ“Ž Citation

If you found this repository useful, please consider citing:

@article{li2023visionlanguage,
      title={Vision-Language Instruction Tuning: A Review and Analysis}, 
      author={Chen Li and Yixiao Ge and Dian Li and Ying Shan},
      year={2023},
      eprint={2311.08172},
      archivePrefix={arXiv},
      primaryClass={cs.MM}
}

πŸ‘πŸ» Acknowledgement

We would like to thank LLaVA, LAVIS and OpenFlamingo for their well-architcated multi-modal LLMs. Thanks to SEED-Bench for being an open source and convenient benchmark for evaluating MLLMs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published