Skip to content

Contextual Object Detection with Multimodal Large Language Models

License

Notifications You must be signed in to change notification settings

yuhangzang/ContextDET

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contextual Object Detection with Multimodal Large Language Models

S-Lab, Nanyang Technological University

🆕 Updates in Oct 2024

  • HuggingFace demo and model checkpoint, see here
  • The CODE dataset for evaluation, see here
  • ContextDET training scripts, see here (waiting to be cleaned up)

🌟 Contextual Object Detection

Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection--understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.

Comparison with Related Works

Task Language Input Output(s) Remark
Object Detection box, class label pre-defined class labels
Open-Vocabulary Object Detection (optional) class names for CLIP box, class label pre-defined class labels
Referring Expression Comprehension complete referring expression box that expression refers to /
Contextual Cloze Test (ours) incomplete expression, object names are masked {box, name} to complete the mask name could be most valid English word
Image Captioning language caption /
Contextual Captioning (ours) language caption, box /
Visual Question Answering language question language answer /
Contextual QA (ours) language question language question, box /

😎 Method

We present ContextDET, a novel generate-then-detect framework, specialized for contextual object detection. ContextDET is end-to-end and consists of three key architectural components:

  1. a visual encoder that extracts high-level image representations and computes visual tokens,
  2. a pre-trained LLM that decodes multimodal contextual tokens with a task-related multimodal prefix, and
  3. a visual decoder that predicts matching scores and bounding boxes for conditional queries linked to contextual object words.

The new generate-then-detect framework enables us to detect object words within human vocabulary.

🥰 Qualitative Examples

💻 Try Demo

🤗 You can try our demo on HuggingFace spaces. To avoid waiting in the queue and speed up your inference, consider duplicating the space and use GPU resources.

🤗 If you want to try the demo on your own computer with GPU, follow these steps

  1. Install the required python packages:
pip install -r requirements.txt
  1. Download the checkpoint file from the following URL and save it in your local directory.
  2. Now, you're ready to run the demo. Execute the following command:
python app.py

You are expected to see the following web page:

📝 Citation

We would be grateful if you consider citing our work if you find it useful:

@article{zang2023contextual,
  author = {Zang, Yuhang and Li, Wei and Han, Jun and Zhou, Kaiyang and Loy, Chen Change},
  title = {Contextual Object Detection with Multimodal Large Language Models},
  journal = {arXiv preprint arXiv:2305.18279},
  year = {2023}
}

📋 Liscense

This project is licensed under S-Lab License 1.0. Redistribution and use for non-commercial purposes should follow this license.

😃 Acknowledgement

We acknowledge the use of the following public code in this project: 1DETR, 2Deformable DETR, 3DETA, 4OV DETR, 5BLIP2.

📧 Contact

If you have any questions, please feel free to contact Yuhang Zang (zang0012 AT ntu.edu.sg).

About

Contextual Object Detection with Multimodal Large Language Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published