[CVPR2024 Highlight] MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang and Yu Qiao

🔥 Updates

2024/06/29: The instruction data for VideoChat2_HD is updated in VideoChat2-IT, which is helpful for more detailed and accurate responses.
2024/06/25: We release the branch of videochat2 using vllm, speed up the inference of videochat2.
2024/06/19: 🎉🎉 Our VideoChat2 achieves the best performances among the open-sourced VideoLLMs on MLVU, a multi-task long video understanding benchmark.
2024/06/13: Fix some bug and give testing scripts/
- ⚠️ We replace some repeated (~30) QAs in MVBench, which may only affect the results by 0.5%.
- 📢 We give the scripts for testing EgoSchema and Video-MME, please check the demo_mistral.ipynb and demo_mistral_hd.ipynb.
2024/06/07: 🔥🔥🔥 We release VideoChat2_HD, which is fine-tuned with high-resolution data and is capable of handling more diverse tasks. It showcases better performance on different benchmarks, especially for detailed captioning. Furthermore, it achieves 54.8% on Video-MME, the best score among 7B MLLMs. Have a try! 🏃🏻‍♀️🏃🏻
2024/06/06: We release VideoChat2_phi3, a faster model with robust performaces.
2024/05/22: We release VideoChat2_mistral, which shows better capacity on diverse tasks (60.4% on MVBench, 78.6% on NExT-QA, 63.8% on STAR, 46.4% on TVQA, 54.4% on EgoSchema-full and 80.5% on IntentQA). More details have been updated in the paper.
2024/04/05: MVBench is selected as Poster (Highlight)! 🎉🎉
2024/02/27: MVBench is accepted by CVPR2024! 🎉🎉
2023/12/17: Online Leaderboard:
- We maintain an online leaderboard on HuggingFace.
- Evaluation results of GPT-4V and Gemini Pro are added.
2023/12/04: Brief introduction:
- 📃 Chinese Blog
- 📽️ YouTube Video, BiliBili Video
2023/11/29: Release VideoChat2 and MVBench:
- VideoChat2 is a robust baseline built on UMT and Vicuna-v0.
- 2M diverse instruction data are released for effective tuning.
- MVBench is a comprehensive benchmark for video understanding.

🦜 VideoChat2

Progressive Training

Stage1 aligns UMT-L, the visual encoder, with QFormer to efficiently compress extensive visual inputs. Stage2 extends this connection to incorporate LLM, while Stage3 focuses on effective instruction tuning to enhance model performance.

Instruction Data

We build a diver instruction data with 2M samples from 34 distince sources. Check DATA for more details.

Model

	ViT	QFormer	LLM	LoRA	Shell (Vicuna)	Model (Vicuna)	Shell (Mistral)	Model (Mistral)	Shell (Phi3)	Model (Phi3)
Stage1	❄️	🔥	🚫	🚫	config & run	🤗ckpt	SAME	SAME	SAME	SAME
Stage2	🔥	🔥	❄️	🚫	config & run	🤗ckpt	config & run	🤗ckpt	config & run	🤗ckpt
Stage3	🔥	🔥	❄️	🔥	config & run	🤗ckpt	config & run	🤗ckpt	config & run	🤗ckpt
Stage4_HD	🔥	🔥	❄️	🔥	-	-	config & run	🤗ckpt	-	-

Results

Model	MVBench	Video-MME	Video-MME w/ subtitles	Video ChatGPT	NExT-QA (in-domain)	STAR (zero-shot)	TVQA (zero-shot)	EgoSchema (full)	EgoSchema (subset)	IntentQA (in-domain Val)	IntentQA (in-domain Test)
VideoChat2 (Vicuna)	51.1	-	-	2.98	68.6	59.0	40.6	-	-	-	-
VideoChat2 (Phi3)	55.1	-	-	2.91	73.1	63.3	40.1	56.7	59.8	69.0	71.6
VideoChat2 (Mistral)	60.4	42.3	54.6	2.95	78.6	63.8	46.4	54.4	63.6	80.5	81.9
VideoChat2_HD (Mistral)	62.3	45.3	55.7	3.10	79.5	63.9	50.6	55.8	65.6	81.1	83.4

(2024/06/07) For Video-MME, our current version has some missing videos and subtitles, see issue

Missing videos: Short (2), Medium (3), Long (11)

Missing subtitles: Short (93), Medium (52), Long (10)

For VideoChatGPT, the VideoChat2_mistral and VideoChat2_phi3 are evaluated based on gpt-3.5-turbo-0125, while the VideoChat2_vicuna used gpt-3.5-turbo-1106.

For NExT-QA, we report in-domain results since the training set are used as instruction data.

For STAR, we input 32 frames, but we input 16 frames for other datasets.

For IntentQA, we report the result on validation and testing splits.

For testing EgoSchema and Video-MME, please check the demo_mistral.ipynb and demo_mistral_hd.ipynb.

Usage

Prepare the envirment:

conda create -n videochat2 python=3.9
conda activate videochat2
pip install -r requirements.txt

Stage1 training:
- Download UMT-L/16 model and set pretrained in stage1_config
```
bash scripts/videochat_vicuna/run_7b_stage1.sh
```
Stage2 training:
- Set vit_blip_model_path and llama_model_path in vicuna_stage2_config, or mistral_model_path in mistral_stage2_config
- For VideoBLIP, you can download Stage1 model
- For LLM, please follow here to prepare vicuna-7b-v0. Or directly download Mistral-7B-Instruct-v0.2.
```
# Vicuna
bash scripts/videochat_vicuna/run_7b_stage2.sh
# Mistral
bash scripts/videochat_mistral/run_7b_stage2.sh
```
Stage3 training:
- Download instruction data and set data_dir in instruction_data.py
- Set vit_blip_model_path, llama_model_path and videochat2_model_path in vicuna_stage3_config or mistral_stage3_config
- You can download Stage2 model and create instruction data for your own tuning
```
# Vicuna
bash scripts/videochat_vicuna/run_7b_stage3.sh
# Mistral
bash scripts/videochat_mistral/run_7b_stage3.sh
```

Runing demo:

Jupyter Notebook: demo.ipynb
Gradio:

# Set the related model path in configs/config.json and demo/demo.py
python demo/demo.py

Evaluation:
- MVBench: mvbench.ipynb. The script is used for Vicuna, and for Mistral, please follow demo_mistral.ipynb to change the script.
- For VideoChatGPT Benchmark, we follow the original repo and use ChatGPT-3.5 to evalute the performances.
- For NExT-QA, STAR and TVQA, we follow SeViLA to prepare the data. And we simple modify mvbench.ipynb and directly output the options to calculate the accuracy.

📊 MVBench

We propose a comprehensive video understanding benchmark with 20 challenging video tasks, where our VideoChat2 secures the top ranking on 15 tasks. More details can be found here.

The online leaderboard is held in 🤗 Hugging Face.

📄 Citation

If you find this project useful in your research, please consider cite:

@article{2023videochat,
  title={VideoChat: Chat-Centric Video Understanding},
  author={KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao},
  journal={arXiv preprint arXiv:2305.06355},
  year={2023}
}

@misc{li2023mvbench,
      title={MVBench: A Comprehensive Multi-modal Video Understanding Benchmark}, 
      author={Kunchang Li and Yali Wang and Yinan He and Yizhuo Li and Yi Wang and Yi Liu and Zun Wang and Jilan Xu and Guo Chen and Ping Luo and Limin Wang and Yu Qiao},
      year={2023},
      eprint={2311.17005},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

💫 Acknowledgement

Thanks to the open source of the following projects:

InternVid, UMT, MiniGPT-4, LLaVA, BLIP2, VideoChatGPT, Vicuna, M3-IT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

[CVPR2024 Highlight] MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

🔥 Updates

🦜 VideoChat2

Progressive Training

Instruction Data

Model

Results

Usage

📊 MVBench

📄 Citation

💫 Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

[CVPR2024 Highlight] MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

🔥 Updates

🦜 VideoChat2

Progressive Training

Instruction Data

Model

Results

Usage

📊 MVBench

📄 Citation

💫 Acknowledgement