Skip to content

Explore VLM-Eval, a framework for evaluating Video Large Language Models, enhancing your video analysis with cutting-edge AI technology.

License

Notifications You must be signed in to change notification settings

zyayoung/Awesome-Video-LLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VLM-Eval: A General Evaluation on Video Large Language Models 🎯🔍

arXiv License Hits Stars GitHub issues GitHub closed issues Awesome

Welcome to the official repository of VLM-Eval: A General Evaluation on Video Large Language Models, the go-to source for a comprehensive evaluation of Video Large Language Models. Immerse yourself in this pioneering project that sets a new standard in the video AI domain.

What Sets VLM-Eval Apart?

Discover how VLM-Eval leverages the power of human-verified GPT-based and retrieval-based evaluation, covering tasks from video captioning to action recognition.

The framework distinguishes itself with its dual evaluation approach. The GPT-based evaluation harnesses the capabilities of Generative Pre-trained Transformers to analyze and interpret video captions, ensuring reliable and efficient assessment through human-verified prompts. On the other hand, the retrieval-based evaluation focuses on the practical application of these models in real-world scenarios, emphasizing their utility in downstream applications.

Awesome Video Large Language Models 🎥🦙

We welcome additions to the list of papers and models, as well as evaluation results.

  • Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding Paper Github Demo
  • VideoChat: Chat-Centric Video Understanding Paper Github
  • Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Paper Github
  • Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Paper Github Demo
  • VideoChat2: Chat-Centric Video Understanding Paper Github Demo
  • LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Paper Github Demo

Datasets 💾🌐

The following datasets are used in our training.

Dataset Task Domain Scale
WebVid link Video captioning 10M
NExT-QA link Video QA 5K
DiDemo link Video captioning, temporal localization 10K
MSRVTT link Video QA, video captioning 10K
MSVD link Video QA, video captioning 2K
TGIF-QA link Video QA 72K
HMDB51 link Action recognition 7K
UCF101 link Action recognition 13K
VideoInstruct-100K link Chat 100K

Other datasets:

Dataset Task Domain Scale
VideoChatInstruct-11K link Chat 11K
Valley-webvid2M-Pretrain-703K link Video captioning 703K
Valley-Instruct-73k link Chat 73K
Kinetics-400 link Action recognition 650K

Results

Dive into our detailed benchmarking results, showcasing the capabilities of different video language models and setting new standards in the field.

Method QA
Acc
QA
Match
Cap
Prec
Cap
Cov
T2V
Acc5
V2T
Acc5
Act
Acc1
Act
Acc5
Video-LLaMA 32.2 2.26 1.99 2.08 23.4 23.4 28.2 47.6
VideoChat 41.6 2.63 2.02 2.16 25.8 30.3 36.0 56.8
Video-ChatGPT 46.2 2.84 2.49 2.65 28.2 32.8 40.3 64.5
Video-LLaVA (VLM-Eval) 48.8 2.98 2.75 2.86 33.5 31.9 43.0 64.3
Video-LLaVA (Lin et al.) 48.0 2.90 2.18 2.29 29.7 31.4 41.3 63.9
VideoChat2 44.6 2.80 2.29 2.42 28.9 28.7 35.6 55.4
LLaMA-VID 50.1 2.97 2.18 2.25 29.4 30.1 45.1 67.4

Get Started with VLM-Eval: Your Journey Begins Here 🚀👨‍💻

For those eager to explore VLM-Eval, this section provides a comprehensive guide to get started. It includes instructions on setting up the environment, along with examples and tutorials to help users effectively utilize the framework for their research. At the current stage, you must download videos from the original datasets. We also highly recommend that you request us to test your open-source video LLM.

Set OpanAI API Key

The initial step involves setting your OpenAI key in the environment. You can do this by the command below.

export OPENAI_API_KEY="your api key here"

Caption GPT

The following command provided will guide GPT-3.5 to assess video captions within the MSVD dataset. The overall usage comprises approximately 62K prompt tokens and 14K completion tokens. This process will only incur a cost of about $0.09.

It is normal to encounter some errors during the GPT evaluation process. The evaluation code is designed to retry automatically.

You may download the prediction file msvd_cap_results_final.json from here, and the gt file MMGPT_evalformat_MSVD_Caption_test.json from here.

python -m eval_bench.gpt_eval.evaluate_cap_pr_v2 \
    --pred_path path/to/msvd_cap_results_final.json \
    --gt /data/data/labels/test/MMGPT_evalformat_MSVD_Caption_test.json \
    --output_dir results/ours/msvd_cap_pr/ \
    --output_json results/ours/acc_msvd_cap_pr.json \
    --out_path results/ours/cap_pr_metric_msvd.json \
    --num_tasks 8 \
    --kernel_size 5 \
    --max_try_times 15

Caption Text-Video Retrieval

python -m eval_bench.eval_cap_ret \
    --eval_task retrieval \
    --pred path/to/msvd_cap_results_final.json \
    --gt /data/data/labels/test/MMGPT_evalformat_MSVD_Caption_test.json \
    --clip_model ViT-B/32

Caption CIDEr

python -m eval_bench.eval_cap_ret \
    --eval_task caption \
    --pred path/to/msvd_cap_results_final.json \
    --gt /data/data/labels/test/MMGPT_evalformat_MSVD_Caption_test.json

QA GPT

You may download the prediction file msvd_qa_preds_final.json from here. There is no need for a separate ground truth file since the ground truth is already embedded within the prediction file.

python -m eval_bench.gpt_eval.evaluate_qa_msvd \
    --pred_path path/to/msvd_qa_preds_final.json \
    --output_dir results/ours/msvd_qa/ \
    --output_json results/ours/acc_msvd_qa.json \
    --out_path results/ours/qa_metric_msvd.json \
    --num_tasks 8 \
    --task_name MSVD \
    --kernel_size 10 \
    --max_try_times 15

Action Recognition

python -m eval_bench.eval_cap_ret \
    --eval_task action \
    --pred_path mmgpt/0915_act/ucf101_action_preds_final.json \
    --out_path results/mmgpt/0915_act/ucf101_action_preds_final.json \
    --clip_model ViT-B/32

Enjoy exploring and implementing VLM-Eval! 💡👩‍💻🌍

License 📝🔐

This project is licensed under the terms of the MIT License.

Join the Awesome Video Large Language Models Community 🎓🤝

VLM-Eval encourages active community participation. Researchers and developers are invited to contribute their papers, models, and evaluation results, fostering a collaborative environment. This section also highlights significant contributions from the community, showcasing a diverse range of insights and advancements in the field.

Contributions from the following sources have shaped this project:

Citing VLM-Eval 📚🖋

If VLM-Eval enhances your research, we appreciate your acknowledgment. Find citation details here, and don't forget to star our repository!

@article{li2023vlm,
  title={VLM-Eval: A General Evaluation on Video Large Language Models},
  author={Li, Shuailin and Zhang, Yuang and Zhao, Yucheng and Wang, Qiuyue and Jia, Fan and Liu, Yingfei and Wang, Tiancai},
  journal={arXiv preprint arXiv:2311.11865},
  year={2023}
}

About

Explore VLM-Eval, a framework for evaluating Video Large Language Models, enhancing your video analysis with cutting-edge AI technology.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages