[Release] lmms-eval v0.3.0 release (#428)

* [Feat] Add qwen2_audio model support and Automatic speech recognition task with LibriSpeech dataset (#289) * "add qwen2_audio model, asr librispeech eval task" * lint update for PR #289 --------- Co-authored-by: Pengyun <u7978909@anu.edu.au> * add clotho_aqa task * Apply black formatting * formatting * excluding xl due to downloading issue. * [Feat] add audiobench version of clothoaqa (#302) * add clothoaqa task * formatting * minor fixes * minor fixes * Add AIR_bench task (#315) * add air_bench * minor changes * add common_voice_15 and people_speech tasks (#316) Co-authored-by: Pengyun <u7978909@anu.edu.au> * add indent to yaml * Add openhermes task (#323) * add openhermes task * formatting * [Refactor] Fixing doc to audio return type, qwen_audio revise (#329) * Add downsample function for audio array * Batch support for qwen2 and use apply chat template * Return sr for common voice * Doc to audio to return the whole dict * add muchomusic and vocalsound task (#331) * add alpaca audio task (#333) * [feat] added gigaspeech config (#334) * fix xl yaml * Fixed config for gigaspeech_xl. gigaspeech_xl_test has intermittent problem. * add alpaca audio task (#333) * pre-committed utils.py --------- Co-authored-by: Cong <101887866+pbcong@users.noreply.github.com> * add tedlium_long_form and tedlium_dev_test tasks (#345) Co-authored-by: Pengyun <u7978909@anu.edu.au> * [Feat] add-wavcaps (#349) * fix xl yaml * Fixed config for gigaspeech_xl. gigaspeech_xl_test has intermittent problem. * add alpaca audio task (#333) * pre-committed utils.py * add wavcaps * add wavcaps --------- Co-authored-by: Cong <101887866+pbcong@users.noreply.github.com> * Update dep and fix log samples for audio (#355) * Update dep * Fix saved audio OOM error * Fix typing * Fix librispeech dataset name * Add add_generation_prompt as option for Qwen audio * Add add system propmt as optional * fix vocalsound (#362) * Add using simple prompt for Qwen2 Audio to align (#360) * Add retry for gpt api call and improve air_bench aggregation function (#376) * add retry for api calls and change air_bench_foundation aggregation function * make azure default api * minor changes * [Feat] Add mix_evals audio2text (#420) * Add mix_evals audio2text * Fix task tags in datasets * Gemini Audio (#421) * gemini audio * better variable naming * Revise prompt * delete redundant tasks in gigaspeech * Fix wavcaps bugs * Add lmms-eval-0.3 docs Update lmms-eval-0.3.md fix errors in markdown and add hyperlinks proofread markdown and fix errors rewrite some parts to fix errors rewrite some parts to fix errors rewrite some parts to fix errors try optimize the table format using html try optimize the table format using html try optimize the table 2 format final proofread final proofread final proofread add explanantion for AIF and ASR standardize WER to WER(↓) final proofread final proofread final proofread final proofread correct hyperlink errors modify readme to support lmms-eval0.3.0 release modify icon fix typos Co-Authored-By: KairuiHu <kairuih12@gmail.com> --------- Co-authored-by: Pengyun Wang <91826032+Prophet-C@users.noreply.github.com> Co-authored-by: Pengyun <u7978909@anu.edu.au> Co-authored-by: pbcong <congphamba2005@gmail.com> Co-authored-by: Li Bo <drluodian@gmail.com> Co-authored-by: Cong <101887866+pbcong@users.noreply.github.com> Co-authored-by: Yingluo <liyingluo57@gmail.com> Co-authored-by: Totoluo <52833580+Yingluo-momo@users.noreply.github.com> Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg> Co-authored-by: KairuiHu <kairuih12@gmail.com>
EvolvingLMMs-Lab · Nov 27, 2024 · 0cee464 · 0cee464
1 parent d2056e6
commit 0cee464
Show file tree

Hide file tree

Showing 82 changed files with 9,828 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -19,18 +19,27 @@
 ---
 
 ## Annoucement
+
+- [2024-11] 🔈🔊 The `lmms-eval/v0.3.0` has been upgraded to support audio evaluations for audio models like Qwen2-Audio and Gemini_Audio across tasks such as AIR-Bench, Clotho-AQA, LibriSpeech, and more. Please refer to the [blog](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/lmms-eval-0.3.md) for more details!
+
+- [2024-07] 🎉🎉 We have released the [technical report](https://arxiv.org/abs/2407.12772) and [LiveBench](https://huggingface.co/spaces/lmms-lab/LiveBench)! 
+
+- [2024-06] 🎬🎬 The `lmms-eval/v0.2.0` has been upgraded to support video evaluations for video models like LLaVA-NeXT Video and Gemini 1.5 Pro across tasks such as EgoSchema, PerceptionTest, VideoMME, and more. Please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.2/) for more details!
+
+- [2024-03] 📝📝 We have released the first version of `lmms-eval`, please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.1/) for more details!
+
+<details>
+<summary>We warmly welcome contributions from the open-source community! Below is a chronological list of recent tasks, models, and features added by our amazing contributors. </summary>
+
 - [2024-10] 🎉🎉 We welcome the new task [NaturalBench](https://huggingface.co/datasets/BaiqiL/NaturalBench), a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery.
 - [2024-10] 🎉🎉 We welcome the new task [TemporalBench](https://huggingface.co/datasets/microsoft/TemporalBench) for fine-grained temporal understanding and reasoning for videos, which reveals a huge (>30%) human-AI gap.
 - [2024-10] 🎉🎉 We welcome the new tasks [VDC](https://rese1f.github.io/aurora-web/) for video detailed captioning, [MovieChat-1K](https://rese1f.github.io/MovieChat/) for long-form video understanding, and [Vinoground](https://vinoground.github.io/), a temporal counterfactual LMM benchmark composed of 1000 short natural video-caption pairs. We also welcome the new models: [AuroraCap](https://github.com/rese1f/aurora) and [MovieChat](https://github.com/rese1f/MovieChat).
 - [2024-09] 🎉🎉 We welcome the new tasks [MMSearch](https://mmsearch.github.io/) and [MME-RealWorld](https://mme-realworld.github.io/) for inference acceleration
 - [2024-09] ⚙️️⚙️️️️ We upgrade `lmms-eval` to `0.2.3` with more tasks and features. We support a compact set of language tasks evaluations (code credit to [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)), and we remove the registration logic at start (for all models and tasks) to reduce the overhead. Now `lmms-eval` only launches necessary tasks/models. Please check the [release notes](https://github.com/EvolvingLMMs-Lab/lmms-eval/releases/tag/v0.2.3) for more details.
 - [2024-08] 🎉🎉 We welcome the new model [LLaVA-OneVision](https://huggingface.co/papers/2408.03326), [Mantis](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/162), new tasks [MVBench](https://huggingface.co/datasets/OpenGVLab/MVBench), [LongVideoBench](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/117), [MMStar](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/158). We provide new feature of SGlang Runtime API for llava-onevision model, please refer the [doc](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/commands.md) for inference acceleration
-- [2024-07] 🎉🎉 We have released the [technical report](https://arxiv.org/abs/2407.12772) and [LiveBench](https://huggingface.co/spaces/lmms-lab/LiveBench)! 
 - [2024-07] 👨‍💻👨‍💻 The `lmms-eval/v0.2.1` has been upgraded to support more models, including [LongVA](https://github.com/EvolvingLMMs-Lab/LongVA), [InternVL-2](https://github.com/OpenGVLab/InternVL), [VILA](https://github.com/NVlabs/VILA), and many more evaluation tasks, e.g. [Details Captions](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/136), [MLVU](https://arxiv.org/abs/2406.04264), [WildVision-Bench](https://huggingface.co/datasets/WildVision/wildvision-arena-data), [VITATECS](https://github.com/lscpku/VITATECS) and [LLaVA-Interleave-Bench](https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/).
 
-- [2024-06] 🎬🎬 The `lmms-eval/v0.2.0` has been upgraded to support video evaluations for video models like LLaVA-NeXT Video and Gemini 1.5 Pro across tasks such as EgoSchema, PerceptionTest, VideoMME, and more. Please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.2/) for more details
-
-- [2024-03] 📝📝 We have released the first version of `lmms-eval`, please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.1/) for more details
+</details>
 
 ## Why `lmms-eval`?