Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add MMMU evals and runner (openai#1442)
## Eval details 📑 ### Eval name MMMU ### Eval description A multi-modal version of MMLU published here: https://arxiv.org/pdf/2311.16502.pdf ### What makes this a useful eval? Tests a variety of subjects, along with image recognition and comprehension ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value Multimodal, covers many subjects ## Eval structure 🏗️ Your eval should - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval ### Eval JSON data Dataset defined here: https://huggingface.co/datasets/MMMU/MMMU ### Eval Results on `gpt-4-vision-preview`: ``` { "mmmu-accounting": 0.5333333333333333, "mmmu-agriculture": 0.6333333333333333, "mmmu-architecture-and-engineering": 0.16666666666666666, "mmmu-art": 0.7333333333333333, "mmmu-art-theory": 0.8333333333333334, "mmmu-basic-medical-science": 0.6, "mmmu-biology": 0.43333333333333335, "mmmu-chemistry": 0.43333333333333335, "mmmu-clinical-medicine": 0.6333333333333333, "mmmu-computer-science": 0.6333333333333333, "mmmu-design": 0.7666666666666667, "mmmu-diagnostics-and-laboratory-medicine": 0.3, "mmmu-economics": 0.6333333333333333, "mmmu-electronics": 0.4, "mmmu-energy-and-power": 0.36666666666666664, "mmmu-finance": 0.43333333333333335, "mmmu-geography": 0.4, "mmmu-history": 0.6666666666666666, "mmmu-literature": 0.9, "mmmu-manage": 0.6, "mmmu-marketing": 0.6333333333333333, "mmmu-materials": 0.26666666666666666, "mmmu-math": 0.5, "mmmu-mechanical-engineering": 0.23333333333333334, "mmmu-music": 0.36666666666666664, "mmmu-pharmacy": 0.7666666666666667, "mmmu-physics": 0.43333333333333335, "mmmu-psychology": 0.7, "mmmu-public-health": 0.8, "mmmu-sociology": 0.5666666666666667 } Average accuracy: 0.5455555555555556 ``` Note that this is slightly lower than the MMMU paper's findings of `0.568`. There's likely prompt engineering that could be done to improve this, but I'll leave that as an exercise for later
- Loading branch information