Instruction-Guided Visual Masking

[📚paper] [project page] [🤗Dataset] [🤗model]

🔥News [2024.09.26] IVM has been accepted by NeurIPS 2024

🔥News [2024.07.21] IVM has been selected as outstanding paper at MFM-EAI workshop @ICML2024

Introduction

We introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plug-and-play tool, significantly boosts the performance of diverse multimodal models.

Duck on green plate | Red cup on red plate | Red cup on red plate | Red cup on silver pan | Red cup on silver pan

Content

Quick Start
Model Zoo
Datasets

Quick Start

Install

Clone this repository and navigate to IVM folder

git clone https://github.com/2toinf/IVM.git
cd IVM

Install Package

conda create -n IVM python=3.10 -y
conda activate IVM
pip install -e .

Usage

from IVM import load, forward_batch
ckpt_path = "IVM-V1.0.bin" # your model path here
model = load(ckpt_path, low_gpu_memory = False) # Set `low_gpu_memory=True` if you don't have enough GPU Memory
image = Image.open("image/demo/robot.jpg") # your image path
instruction = "pick up the red cup and place it on the green pan" 
result = forward_batch(model, [image], [instruction], threshold = 0.99)
from matplotlib import pyplot as plt
import numpy as np
plt.imshow((result[0]).astype(np.uint8))

For more intresting cases, please refer to demo.ipynb

Model Zoo

Models	basemodel	Params (M)	Iters	ckpt
IVM-V1.0	LLava-1.5-7B + SAM-H	64M	1M	HF-link

We welcome everyone to further explore more IVM training methods and further scale it up!.

Evaluation

Please first preprocess the test images using our IVM model, then follow the official instructions for evaluation.

VQA-type benchmarks

V* Bench: https://github.com/penghao-wu/vstar?tab=readme-ov-file#evaluation

Traditional VQA benchmark: https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#evaluation

Real-Robot

Policy Learning: https://github.com/Facebear-ljx/BearRobot

Robot Infrastructure: https://github.com/rail-berkeley/bridge_data_robot

IVM-Mix-1M Dataset

Please download the annotations of our IVM-Mix-1M. We provide over 1M image-instruction pairs with corresponding mask labels. Our IVM-Mix-1M dataset consists of three part: HumanLabelData, RobotMachineData and [VQAMachineData]. For the HumanLabelData and RobotMachineData, we provide well-orgnized images, mask label and language instructions. For the VQAMachineData, we only provide mask label and language instructions and please download the images from constituting datasets.

COCO: train2017, train2014
GQA: images
TextVQA: train_val_images
VisualGenome: part1, part2
Flickr30k: homepage
Open images: download script, we only use 0-5 splits
VSR: images

After downloading all of them, organize the data as follows,

├── coco
│   └── train2017
│   └── train2014
├── gqa
│   └── images
├── textvqa
│   └── train_images
└── vg
│   ├── VG_100K
│   └── VG_100K_2
└── flickr30k
│   └── images
└── vsr
└── openimages

We provide a sample code for reading data as a reference.

Acknowledgement

This work is built upon the LLaVA and SAM and LISA.

Citation

@misc{zheng2024instructionguided,
            title={Instruction-Guided Visual Masking}, 
            author={Jinliang Zheng and Jianxiong Li and Sijie Cheng and Yinan Zheng and Jiaming Li and Jihao Liu and Yu Liu and Jingjing Liu and Xianyuan Zhan},
            year={2024},
            eprint={2405.19783},
            archivePrefix={arXiv},
            primaryClass={cs.CV}
        }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Instruction-Guided Visual Masking

🔥News [2024.09.26] IVM has been accepted by NeurIPS 2024

🔥News [2024.07.21] IVM has been selected as outstanding paper at MFM-EAI workshop @ICML2024

Introduction

Content

Quick Start

Install

Usage

Model Zoo

Evaluation

VQA-type benchmarks

Real-Robot

IVM-Mix-1M Dataset

Acknowledgement

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Instruction-Guided Visual Masking

🔥News [2024.09.26] IVM has been accepted by NeurIPS 2024

🔥News [2024.07.21] IVM has been selected as outstanding paper at MFM-EAI workshop @ICML2024

Introduction

Content

Quick Start

Install

Usage

Model Zoo

Evaluation

VQA-type benchmarks

Real-Robot

IVM-Mix-1M Dataset

Acknowledgement

Citation