Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

This is the repo for our paper: Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction.

Overview

This repo includes our datasets and codes for generation and evaluation.

The directory data contains 8 datasets we contributed for Error Identification and Correction.
The directory evaluation contains evaluation results for closed-source models (GPT-3.5, GPT-4, GLM-4, Gemini Pro) and open-source models (LLaMA-2-7B, LLaMA-2-13B, MetaMath-7B, MetaMath-13B).
The directory code contains our generation and evaluation codes.

Data Release

This directory data contains 8 datasets for Error Identification and Correction. The suffix of the folder name indicates the source of the generated data([GSM8K](GitHub - openai/grade-school-math) or [MathQA](MathQA-Dataset (math-qa.github.io))).

To be specific:

generated_cases_GSM8K: cases of nine error types generated from GSM8K by GPT-4.
generated_cases_MathQA: cases of nine error types generated from MathQA by GPT-4.
EP_robustness_testing_cases_GSM8K: cases of nine error types for EP robustness testing from GSM8K containing 50 correct cases and 50 incorrect cases.
EP_robustness_testing_cases_MathQA: cases of nine error types for EP robustness testing from MathQA.
incomplete_generated_cases_GSM8K: cases of incomplete generated cases from GSM8K.
incomplete_generated_cases_MathQA: cases of incomplete generated cases from MathQA.
step_number_cases_GSM8K: cases for specific step numbers of solution from GSM8K.
step_number_cases_MathQA: cases for specific step numbers of solution from MathQA.

Each data is a dict, and the keys are:

question: problem.
original_solution: correct solution.
original_answer: correct answer.
transformed_solution: incorrect solution after transformation.
transformed_answer: incorrect answer after transformation.
wrong_step: first wrong step occurring in transformed_solution.
wrong_type: wrong type of transformed_solution.
is_single_error: boolean type of whether wrong_type is single.
explanation: explanation of how transformed_solution is transformed from original_solution.

Data Generation & Evaluation Process

This directory code contains all generation and evaluation codes.

generate:

generate.py: Use this to generate error cases(generated_cases_GSM8K). For example:

python generate.py --model_name gpt-3.5-turbo-1106 --dataset GSM8K --selected_type calculation_error --expected_cases 100

evaluate:

evaluate.py: Use this to evaluate the generated error cases for closed-source models. It is similar for open-source models. For example:

python evaluate.py --model_name gpt-3.5-turbo-1106 --dataset GSM8K --selected_type calculation_error --selected_test any  --expected_cases 100

parameter description:

model_name: GPT-3.5, GPT-4, GLM-4 or Gemini Pro.
dataset: GSM8K or MathQA.
selected_type: calculation_error, referencing_context_value_error, referencing_previous_step_value_error, confusing_formula_error, counting_error, missing_step, adding_irrelevant_information, operator_error, and unit_conversion_error, nine error types in total.
selected_test: any, step, type or correction. Besides, the suffix simple is zero-shot, complex is few-shot.
expected_cases: number of cases to be generated or evaluated.

Evaluation Result

This directory evaluation contains evaluation results for closed-source models and open-source models.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
code		code
data		data
evaluation		evaluation
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Overview

Data Release

Data Generation & Evaluation Process

Evaluation Result

About

Releases

Packages

Languages

LittleCirc1e/EIC

Folders and files

Latest commit

History

Repository files navigation

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Overview

Data Release

Data Generation & Evaluation Process

Evaluation Result

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages