Add MMMU evals and runner #1442

etr2460 · 2023-12-20T22:17:38Z

Eval details 📑

Eval name

MMMU

Eval description

A multi-modal version of MMLU published here: https://arxiv.org/pdf/2311.16502.pdf

What makes this a useful eval?

Tests a variety of subjects, along with image recognition and comprehension

Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).

Your eval should be:

Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
Includes good signal around what is the right behavior. This means either a correct answer for Basic evals or the Fact Model-graded eval, or an exhaustive rubric for evaluating answers for the Criteria Model-graded eval.
Include at least 15 high-quality examples.

If there is anything else that makes your eval worth including, please document it below.

Unique eval value

Multimodal, covers many subjects

Eval structure 🏗️

Your eval should

Check that your YAML is registered at evals/registry/evals/{name}.yaml
Ensure you have the right to use the data you submit via this eval

Eval JSON data

Dataset defined here: https://huggingface.co/datasets/MMMU/MMMU

Eval Results

on gpt-4-vision-preview:

{
  "mmmu-accounting": 0.5333333333333333,
  "mmmu-agriculture": 0.6333333333333333,
  "mmmu-architecture-and-engineering": 0.16666666666666666,
  "mmmu-art": 0.7333333333333333,
  "mmmu-art-theory": 0.8333333333333334,
  "mmmu-basic-medical-science": 0.6,
  "mmmu-biology": 0.43333333333333335,
  "mmmu-chemistry": 0.43333333333333335,
  "mmmu-clinical-medicine": 0.6333333333333333,
  "mmmu-computer-science": 0.6333333333333333,
  "mmmu-design": 0.7666666666666667,
  "mmmu-diagnostics-and-laboratory-medicine": 0.3,
  "mmmu-economics": 0.6333333333333333,
  "mmmu-electronics": 0.4,
  "mmmu-energy-and-power": 0.36666666666666664,
  "mmmu-finance": 0.43333333333333335,
  "mmmu-geography": 0.4,
  "mmmu-history": 0.6666666666666666,
  "mmmu-literature": 0.9,
  "mmmu-manage": 0.6,
  "mmmu-marketing": 0.6333333333333333,
  "mmmu-materials": 0.26666666666666666,
  "mmmu-math": 0.5,
  "mmmu-mechanical-engineering": 0.23333333333333334,
  "mmmu-music": 0.36666666666666664,
  "mmmu-pharmacy": 0.7666666666666667,
  "mmmu-physics": 0.43333333333333335,
  "mmmu-psychology": 0.7,
  "mmmu-public-health": 0.8,
  "mmmu-sociology": 0.5666666666666667
}
Average accuracy: 0.5455555555555556

Note that this is slightly lower than the MMMU paper's findings of 0.568. There's likely prompt engineering that could be done to improve this, but I'll leave that as an exercise for later

logankilpatrick

Stamp!

## Eval details 📑 ### Eval name MMMU ### Eval description A multi-modal version of MMLU published here: https://arxiv.org/pdf/2311.16502.pdf ### What makes this a useful eval? Tests a variety of subjects, along with image recognition and comprehension ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value Multimodal, covers many subjects ## Eval structure 🏗️ Your eval should - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval ### Eval JSON data Dataset defined here: https://huggingface.co/datasets/MMMU/MMMU ### Eval Results on `gpt-4-vision-preview`: ``` { "mmmu-accounting": 0.5333333333333333, "mmmu-agriculture": 0.6333333333333333, "mmmu-architecture-and-engineering": 0.16666666666666666, "mmmu-art": 0.7333333333333333, "mmmu-art-theory": 0.8333333333333334, "mmmu-basic-medical-science": 0.6, "mmmu-biology": 0.43333333333333335, "mmmu-chemistry": 0.43333333333333335, "mmmu-clinical-medicine": 0.6333333333333333, "mmmu-computer-science": 0.6333333333333333, "mmmu-design": 0.7666666666666667, "mmmu-diagnostics-and-laboratory-medicine": 0.3, "mmmu-economics": 0.6333333333333333, "mmmu-electronics": 0.4, "mmmu-energy-and-power": 0.36666666666666664, "mmmu-finance": 0.43333333333333335, "mmmu-geography": 0.4, "mmmu-history": 0.6666666666666666, "mmmu-literature": 0.9, "mmmu-manage": 0.6, "mmmu-marketing": 0.6333333333333333, "mmmu-materials": 0.26666666666666666, "mmmu-math": 0.5, "mmmu-mechanical-engineering": 0.23333333333333334, "mmmu-music": 0.36666666666666664, "mmmu-pharmacy": 0.7666666666666667, "mmmu-physics": 0.43333333333333335, "mmmu-psychology": 0.7, "mmmu-public-health": 0.8, "mmmu-sociology": 0.5666666666666667 } Average accuracy: 0.5455555555555556 ``` Note that this is slightly lower than the MMMU paper's findings of `0.568`. There's likely prompt engineering that could be done to improve this, but I'll leave that as an exercise for later

etr2460 requested review from andrew-openai, jwang47, logankilpatrick and katyhshi as code owners December 20, 2023 22:17

etr2460 force-pushed the erik/mmmu branch from 4961924 to d6f30ea Compare December 20, 2023 22:36

logankilpatrick approved these changes Dec 20, 2023

View reviewed changes

Add MMMU evals and runner

0517ac9

etr2460 force-pushed the erik/mmmu branch from d6f30ea to 0517ac9 Compare December 21, 2023 00:19

etr2460 merged commit f20c305 into main Dec 21, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MMMU evals and runner #1442

Add MMMU evals and runner #1442

etr2460 commented Dec 20, 2023 •

edited

Loading

logankilpatrick left a comment

Add MMMU evals and runner #1442

Add MMMU evals and runner #1442

Conversation

etr2460 commented Dec 20, 2023 • edited Loading

Eval details 📑

Eval name

Eval description

What makes this a useful eval?

Criteria for a good eval ✅

Unique eval value

Eval structure 🏗️

Eval JSON data

Eval Results

logankilpatrick left a comment

Choose a reason for hiding this comment

etr2460 commented Dec 20, 2023 •

edited

Loading