This repo contains the code and data for our benchmark paper:
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal LLMs
H. Wang, H. Shi, S. Tan, W. Qin, W. Wang, T. Zhang, A. Nambi, T. Ganu, H. Wang
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025.
[Paper] [MMNeedle Dataset]
To use this benchmark, please download the MMNeedle dataset at this link. Alternatively, you could also construct your version of MMNeedle by following the instructions below.
[2025-01-22] MMNeedle is accepted to NAACL 2025.
[2024-06-27] New project page set up for MMNeedle.
[2024-06-24] We released the leaderboard for Multimodal Long Context Understanding on paper-with-code!
[2024-06-17] We released the paper, code, and data for Multimodal Needle in a Haystack (MMNeedle) benchmark!
![Screen Shot 2024-06-17 at 7 38 45 PM](https://private-user-images.githubusercontent.com/30172609/340525052-cf481db4-ac83-4940-8897-e27d4faab4a8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyODEyNjEsIm5iZiI6MTczOTI4MDk2MSwicGF0aCI6Ii8zMDE3MjYwOS8zNDA1MjUwNTItY2Y0ODFkYjQtYWM4My00OTQwLTg4OTctZTI3ZDRmYWFiNGE4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDEzMzYwMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWM3YzQzNzE4NzM3OWE3NjMzNTc0ZDM1YWU2N2UzZDA4ODI4NWFjZGYwNjNjYjIwZjBjYjQ4NTI5ZmYxNzVjNjImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.VehA62FHtiKouLDQlBqM4Uzd7KvP2U11j4Rvc9QyftM)
MMNeedle Evaluation Overview. Correct answers are marked with checkmark (
![Screen Shot 2024-06-17 at 7 39 52 PM](https://private-user-images.githubusercontent.com/30172609/340524882-e105e2f6-0585-4cbc-9e56-0f588134412d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyODEyNjEsIm5iZiI6MTczOTI4MDk2MSwicGF0aCI6Ii8zMDE3MjYwOS8zNDA1MjQ4ODItZTEwNWUyZjYtMDU4NS00Y2JjLTllNTYtMGY1ODgxMzQ0MTJkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDEzMzYwMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWU0YzFkYWU5YjRlMDJlOGZmZmU0NGM0NWZmNGE3MWViMGI2YzZkYTQ0OGZhZTJiMzQ4NzhiZDVjMTg5NzkzZTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.SSpO8luo0RmdNe_GOKGoUv4B983JuomdOJ7KjG5DnA0)
MMNeedle Evaluation Performance Comparison (Claude-3 refers to Claude 3 Opus, and Gemini-1.0/1.5 refers to Gemini Pro 1.0/1.5). The x-axis shows the results of different models, and the y-axis shows the results on various input image number M and stitching size N. For each row, i.e., setting (M,N), we show the average accuracy (%) of each model. For each stitched image, the color of row r, column c indicates the accuracy of predicting the exact position for samples with the "needle" sub-image in position (r,c) of the stitched image. For the M=10 setting, we show the average accuracy of each location (r,c) over 10 images. A redder cell indicates lower accuracy, while a greener cell indicates higher accuracy. The best result for each row is marked with underlining.
conda env create -f context.yml
Download MS COCO
put val2014, annotations_trainval dir to current directory.
python ./annotations_trainval/file_to_caption.py
python sample_images.py
python sample_stitched_images.py
python sample_single_needles.py
python sample_multiple_needles.py
export BEGIN=0
export N_SEQ=1000
export N_NEEDLES=1
export MODEL_PROVIDER='Gemini'
bash test.sh
export BEGIN=0
export N_SEQ=1000
python evaluate.py
python evaluate_multi.py
@misc{wang2024multimodal,
title={Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models},
author={Hengyi Wang and
Haizhou Shi and
Shiwei Tan and
Weiyi Qin and
Wenyuan Wang and
Tuny Zhang and
Akshay Nambi and
Tanuja Ganu and
Hao Wang},
year={2024},
eprint={2406.11230},
archivePrefix={arXiv},
primaryClass={cs.LG}
}