📖 MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

MM-Math is a benchmark designed to evaluate the mathematical reasoning capabilities of multimodal models, featuring problems that incorporate images and step-by-step reasoning. All problems are calculation-based and utilize an open-ended format. MM-MATH is annotated across three dimensions: difficulty, grade level, and knowledge points, enabling comprehensive evaluation of multimodal models.

We recognize the potentially high costs associated with model evaluation, particularly in scenarios involving diagrammatic math reasoning. Consequently, we adopt two fully automated evaluation method: outcome evaluation and process evaluation. For outcome evaluation, the final answer in MM-MATH is enclosed in \boxed{} format, requiring the extraction of models' final answer in the same format. For process evaluation, we provide carefully designed prompts, using GPT-4V to compare the model's problem-solving process with the ground truth, identify the first error, and classify it.

🖥️ Leaderboard

Here is the average scores (%) on the outcome evaluation under the Zero-shot scenario. We conducted assessments across three dimensions: difficulty, grade level, and knowledge points. Human evaluation results are derived from exam scores.

Multimodal Modal w/o Image

Model	Easy	Medium	Hard	Seven	Eight	Nine	Trans	Shape	Func	Average
Human	90.7	81.9	47.6	85.6	73.7	77.9	81.1	83.2	77.5	80.4
Gemini-Pro-V	10.1	5.7	1.8	10.0	5.3	6.7	6.6	5.7	6.4	6.2
Claude-3-Opus	31.7	17.3	7.2	32.5	14.9	2.2	20.8	18.5	12.9	19.2
GPT-4	37.0	20.3	7.2	38.7	17.1	26.2	23.3	21.4	18.1	22.5
GPT-4V	35.2	18.1	7.2	31.2	17.2	22.3	18.4	21.4	13.3	20.4
GPT-4o	41.4	23.9	3.6	35.0	23.9	30.5	22.8	29.7	19.4	27.6

Multimodal modal w/ Image

Model	Easy	Medium	Hard	Seven	Eight	Nine	Trans	Shape	Func	Average
Human	90.7	81.9	47.6	85.6	73.7	77.9	81.1	83.2	77.5	80.4
DeepSeek-VL-7B-Chat	17.4	4.7	1.4	7.5	6.6	3.9	3.4	6.0	3.5	5.4
Yi-34B-Chat	12.9	5.0	1.5	21.3	5.6	3.5	5.0	7.6	3.8	6.5
LLaVA-V1.6-34B	8.8	5.4	1.8	12.6	6.5	4.2	4.0	6.5	3.8	5.8
InternVL-4B-Chat-1.5	18.5	10.7	1.8	12.5	11.1	11.9	11.4	12.3	5.5	11.6
Qwen-VL-Max	14.5	11.2	3.6	16.2	1.1	11.3	11.0	12.5	10.5	11.4
Gemini-Pro-V	19.3	8.2	0.0	1.5	7.4	11.5	10.4	10.6	7.1	9.7
Claude-3-Opus	29.5	19.3	3.6	32.5	16.4	23.0	20.6	21.7	16.9	20.3
GPT-4V	37.8	21.2	1.8	28.7	17.9	28.0	22.2	24.7	19.5	23.1
GPT-4o	45.8	30.0	10.9	40.0	26.0	36.0	30.7	33.7	26.2	31.8

Data Format

All data in MM-Math are standardized to the following format:

{
    "question": "The text of each question statement conforms to LaTeX code.",
    "file_name": "The names of the question images in the image folder.",
    "solution": "The text of each question' soluation conforms to LaTeX code.",
    "year": "The grade level annotated from each year examination.",
    "difficult": "The difficult level annotated by  examination scores.",
    "knowledge": "Each knowledge points contained in the question, which is annotated by middle school teacher."
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📖 MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

🖥️ Leaderboard

Multimodal Modal w/o Image

Multimodal modal w/ Image

Data Format

About

Releases

Packages

Languages

License

kge-sun/MM-Math

Folders and files

Latest commit

History

Repository files navigation

📖 MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

🖥️ Leaderboard

Multimodal Modal w/o Image

Multimodal modal w/ Image

Data Format

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages