Skip to content

This is the repository of Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction.

Notifications You must be signed in to change notification settings

LittleCirc1e/EIC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

This is the repo for our paper: Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction.

Overview

This repo includes our datasets and codes for generation and evaluation.

image-20240328225051573

  • The directory data contains 8 datasets we contributed for Error Identification and Correction.
  • The directory evaluation contains evaluation results for closed-source models (GPT-3.5, GPT-4, GLM-4, Gemini Pro) and open-source models (LLaMA-2-7B, LLaMA-2-13B, MetaMath-7B, MetaMath-13B).
  • The directory code contains our generation and evaluation codes.

Data Release

This directory data contains 8 datasets for Error Identification and Correction. The suffix of the folder name indicates the source of the generated data([GSM8K](GitHub - openai/grade-school-math) or [MathQA](MathQA-Dataset (math-qa.github.io))).

To be specific:

Each data is a dict, and the keys are:

  • question: problem.
  • original_solution: correct solution.
  • original_answer: correct answer.
  • transformed_solution: incorrect solution after transformation.
  • transformed_answer: incorrect answer after transformation.
  • wrong_step: first wrong step occurring in transformed_solution.
  • wrong_type: wrong type of transformed_solution.
  • is_single_error: boolean type of whether wrong_type is single.
  • explanation: explanation of how transformed_solution is transformed from original_solution.

Data Generation & Evaluation Process

This directory code contains all generation and evaluation codes.

generate:

  • generate.py: Use this to generate error cases(generated_cases_GSM8K). For example:

    python generate.py --model_name gpt-3.5-turbo-1106 --dataset GSM8K --selected_type calculation_error --expected_cases 100

evaluate:

  • evaluate.py: Use this to evaluate the generated error cases for closed-source models. It is similar for open-source models. For example:

    python evaluate.py --model_name gpt-3.5-turbo-1106 --dataset GSM8K --selected_type calculation_error --selected_test any  --expected_cases 100

parameter description:

  • model_name: GPT-3.5, GPT-4, GLM-4 or Gemini Pro.
  • dataset: GSM8K or MathQA.
  • selected_type: calculation_error, referencing_context_value_error, referencing_previous_step_value_error, confusing_formula_error, counting_error, missing_step, adding_irrelevant_information, operator_error, and unit_conversion_error, nine error types in total.
  • selected_test: any, step, type or correction. Besides, the suffix simple is zero-shot, complex is few-shot.
  • expected_cases: number of cases to be generated or evaluated.

Evaluation Result

This directory evaluation contains evaluation results for closed-source models and open-source models.

About

This is the repository of Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages