本项目致力于整理math word problem (MWP) 领域的数据集、论文、工具等信息。
现有工作的复现详见codes文件夹,其中preprocess_data文件夹放置对数据集的预处理代码,在每个文件的开头位置存放数据集的原始下载位置、引用论文和预处理的逻辑;results文件夹是对公开数据集使用不同算法得到的结果。代码的运行命令可参考codes/README.md。
由于数据集可能很大,所以我没有上传到GitHub,但是我会在数据预处理的代码里面介绍数据集的原始下载地址。
以下内容首先介绍MWP任务上不同方法的实验结果,然后介绍MWP任务的数据集,接着介绍MWP任务的论文,最后介绍MWP任务的工具。
由于本项目原来想囊括所有数值推理领域的工作,现在决定专注于MWP任务,所以现在在更改项目的结构,以下内容改完后将删除本段。
[TOC]
QA格式MWP任务(仅考虑输出一个数值答案的数学题。其他实验setting见表后)的准确率指标(相当于只计算test@1。其他指标不管了):
有些数据集有独特的标注信息(比如公式,推理过程,calculator),如果考虑到这种情况的话我会写明,没写就是没有。
解码超参数等没有专门做过优化。
结果可能具有高随机性。代码可能有bug,我修改bug后会随时更新最新结果。
方法名 | Alg514 | AI2 | Dolphin1878 | Math23K | ASDiv | Ape210K | GSM8K | SVAMP |
---|---|---|---|---|---|---|---|---|
GPT-2 | 0 | |||||||
GPT-2 finetune① | 0 | 0.14% | 1.06%③ | |||||
GPT-2 finetune① + calculator② | - | - | - | - | - | - | 1.13% | - |
GPT-2 verifier①② | - | - | - | - | - | - | 0.91% | |
GPT-3.5-Turbo | 82.86% | 93.15% | 66.67% | 60.3% | 86.19% | 46.94% | 78.92% | 79.78% |
GPT-3.5-Turbo CoT | 85.71% | |||||||
GPT-3.5-Turbo CoT+tip | 80% | |||||||
GPT-3.5-Turbo CoT+SC | ||||||||
GPT-3.5-Turbo PRP | 94.29% | |||||||
ChatGLM3-6B | 65.71% | |||||||
GLM-4 | 77.14% | |||||||
Yi-large | 94.29% | |||||||
DeepSeek-V2 | 91.43% | |||||||
Moonshot | 88.57% | |||||||
LLaMA3-8B-Instruct | 65.71% | |||||||
CPM-2 prompt-based finetune |
- 对于没有原始划分方案的数据集随机按照8:1:2的比例进行数据集划分:Alg514 AI2 Dolphin1878 SVAMP
- 使用原数据集中给出的数据划分方案:Math23K Ape210K GSM8K
- tip的理论基础:给ChatGPT小费真的好使!10块或10万效果拔群,但给1毛不升反降
- SC (self-consistency) (2023 ICLR) Self-Consistency Improves Chain of Thought Reasoning in Language Models
- PRP:(2024 AAAI) Re61:读论文 PRP Get an A in Math: Progressive Rectification Prompting
① 将数据集自带的公式/推理过程/计算器信息添加到生成标签中辅助模型训练。具体用的哪个可以看get_data.py里的answer_with_reasoning键的设置
② 用了GSM8K数据集自带的计算器信息来辅助推理
③ 试来试去发现decode时带sample(max_new_tokens=50,do_sample=True,top_k=50,top_p=0.95)的效果最好(2.4%),但是反正大家理解这是怎么个事就行,无所谓了
因为下载地址太占位置了,所以不在这里列出,但是在数据预处理代码文件里面会有。
尽量按时间顺序排列。有些我不确定先后顺序,所以可能有错误。
数据集名称 | 语言 | 出处 | 样本量 | 无法下载的原因和其他备注 |
---|---|---|---|---|
Dolphin18K | 英语 | (2016 ACL) How well do Computers Solve Math Word Problems? Large-Scale Dataset Construction and Evaluation | 18460 | 需要通过URL从雅虎问答下载数据,但是雅虎问答已经倒闭了。没有找到直接下载数据集的来源。有的话请跟我说一声。 |
MAWPS | 英语 | (2016 NAACL) MAWPS: A Math Word Problem Repository | 100K | 我服务器没下Maven,下次有机会再下数据吧 |
SuperCLUE-Math6 | 中文 | (2024) SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in Chinese | 需要申请,懒得搞 |
- (AAAI) Re61:读论文 PRP Get an A in Math: Progressive Rectification Prompting:已复现(PRP),标准QA格式MWP任务,使用大模型做推理,思路是在预测出答案后挖空题目中另一变量,让大模型通过预测结果来预测挖空变量,如果预测正确就视为
- (清华) Augmenting Math Word Problems via Iterative Question Composing
- Scaling the Authoring of AutoTutors with Large Language Models
- 几何题
- (ICDE) Enhancing Quantitative Reasoning Skills of Large Language Models through Dimension Perception:关注数值单位(维度)
- BIBench: Benchmarking Data Analysis Knowledge of Large Language Models:这篇是商务智能那边数据分析领域的研究……也算是数值推理吧
- SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in Chinese
- 数值推理
- (ACL) A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models
- (ICLR) Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning
- (KDD) Exploiting Relation-aware Attribute Representation Learning in Knowledge Graph Embedding for Numerical Reasoning
- (AAAI) An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)
- (EMNLP) MarkQA: A large scale KBQA dataset with numerical reasoning
- (EMNLP) ATHENA: Mathematical Reasoning with Thought Expansion
- (EMNLP) UniMath: A Foundational and Multimodal Mathematical Reasoner
- (ICML) Large Language Models Can Be Easily Distracted by Irrelevant Context
- (EACL) ComSearch: Equation Searching with Combinatorial Strategy for Solving Math Word Problems with Weak Supervision
- (TMLR) Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
- Scaling Relationship on Learning Mathematical Reasoning with Large Language Models:发现微调数据量越大,模型效果越好。提出RFT技术自动采样数据
- CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models
- (港中文+腾讯) StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving:这一篇应该算是通用的解决方案,但是下游任务中包含数值推理
- (耶鲁等多家高校) DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data:这一篇主要考虑带表格长文档的数值推理场景
- (人大、科大讯飞等) Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning
- Learning From Mistakes Makes LLM Better Reasoner
- MWP
- (ACL OpenAI) Interpretable Math Word Problem Solution Generation Via Step-by-step Planning:关注步骤分(bushi)
- 代码:GSM8K数据集
- (ACL) Solving Math Word Problems via Cooperative Reasoning induced Language Models
- (ACL) Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
- (ACL Findings) Compositional Mathematical Encoding for Math Word Problems
- (ACL Industry) MathPrompter: Mathematical Reasoning using Large Language Models
- (AAAI) Generalizing Math Word Problem Solvers via Solution Diversification
- (EMNLP) Non-Autoregressive Math Word Problem Solver with Unified Tree Structure
- (EMNLP) Let GPT be a Math Tutor: Teaching Math Word Problem Solvers with Customized Exercise Generation
- (EMNLP Findings) Conic10K: A Challenging Math Problem Understanding and Reasoning Dataset
- (TKDD) Math Word Problem Generation via Disentangled Memory Retrieval
- (ICLR) Self-Consistency Improves Chain of Thought Reasoning in Language Models
- (IJCNN) Improving Math Word Problems Solver with Logical Semantic Similarity
- (IJCNN) Solving Math Word Problems Following Logically Consistent Template
- (NLPCC) Solving Math Word Problem with Problem Type Classification
- (ICANN) Solving Math Word Problem with External Knowledge and Entailment Loss
- (IEEE International Conference on Big Data) Combining Transformers and Tree-based Decoders for Solving Math Word Problems
- (BEA) Scalable and Explainable Automated Scoring for Open-Ended Constructed Response Math Word Problems:关注MPT问题
- (ICLP) Enhancing Math Word Problem Solving Through Salient Clue Prioritization: A Joint Token-Phrase-Level Feature Integration Approach
- (Computación y Sistemas) Math Word Problem Solving: Operator and Template Techniques with Multi-Head Attention
- Solving Math Word Problems by Combining Language Models With Symbolic Solvers
- Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification:大意是用GPT-4 code interpreter,结合代码与文本生成更强的MWP结果(具体的还没看)
- Exploring Equation as a Better Intermediate Meaning Representation for Numerical Reasoning:通过方程而不是程序作为模型的中间输出(IMR),生成方程是通过LLM实现的
- (清华+智谱AI) GPT Can Solve Mathematical Problems Without a Calculator:提出MathGLM(GLM-10B改)
- Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
- Chatbots put to the test in math and logic problems: A preliminary comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard
- An Empirical Study on Challenging Math Problem Solving with GPT-4
- (耶鲁&卡梅) ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics:证明题
- TinyGSM: achieving >80% on GSM8k with small language models
- Progressive-Hint Prompting Improves Reasoning in Large Language Models
- (ACL OpenAI) Interpretable Math Word Problem Solution Generation Via Step-by-step Planning:关注步骤分(bushi)
- 数值表征
- 集合推理
- 金融
- (Nature DeepMind) Mathematical discoveries from program search with large language models:FunSearch模型,用函数搜索的方式解决数学问题
- (ACL) A Survey of Deep Learning for Mathematical Reasoning
- (ACL Findings) World Models for Math Story Problems
- (EMNLP) MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language Models
- (EMNLP Findings) Large Language Models are Better Reasoners with Self-Verification:有数值推理相关的下游任务
- (EACL) BERT is not The Count: Learning to Match Mathematical Statements with Proofs
- (华师) Math-KG: Construction and Applications of Mathematical Knowledge Graph
- Mathematical Language Models: A Survey
- Bridging the Semantic-Numerical Gap: A Numerical Reasoning Method of Cross-modal Knowledge Graph for Material Property Prediction
2022年
- 数值推理
- MWP
- (EMNLP) Automatic Generation of Socratic Subquestions for Teaching Math Word Problems
- (COLING) WARM: A Weakly (+Semi) Supervised Model for Solving Math word Problems
- (NAACL) Practice Makes a Solver Perfect: Data Augmentation for Math Word Problem Solvers
- (NAACL) MWP-BERT: Numeracy-Augmented Pre-training for Math Word Problem Solving
- (AAAI demo) MWPToolkit: An Open-Source Framework for Deep Learning-Based Math Word Problem Solvers
2021年
- MWP
- (EMNLP) Recall and Learn: A Memory-augmented Solver for Math Word Problems:REAL模型,类比/检索。REAL由存储模块、表示模块、类比模块、推理模块共4个模块组成,对于每个问题,首先通过存储模块检索类似的问题,然后用表示模块和类比模块对类似问题进行表征,二者都使用了自监督mask;最后用基于Copy机制的推理模块来实现公式生成 官方GitHub项目:https://github.com/sfeng-m/REAL4MWP
- (EMNLP) Tree-structured Decoding for Solving Math Word Problems:用树形解码自上而下生成数学方程的抽象语法树。在解码过程中可以自动停止,不需要停止标记。
- (EMNLP) Graph-to-Tree Neural Networks for Learning Structured Input-Output Translation with Applications to Semantic Parsing and Math Word Problem
- (EMNLP Findings) Generate & Rank: A Multi-task Framework for Math Word Problems:致力于解决用通用生成框架解决MWP的场景下的任务性细致优化:构建了一个多任务框架,基于生成式预训练语言模型(在论文中使用的是BART),同时学习生成(generate)和排序(rank),此外还设计了基于树的扰动和对排序器的在线更新机制。排序器是用实时更新的历史表达式数据库来训练的。
- (NeurIPS) REAL2: An End-to-end Memory-augmented Solver for Math Word Problems REAL模型的进化版。 官方GitHub项目:https://github.com/sfeng-m/REAL2
- (NeurIPS) Measuring Mathematical Problem Solving With the MATH Dataset 官方GitHub项目:https://github.com/hendrycks/math/
- (ACL) Compositional Generalization and Natural Language Variation: Can a Semantic Parsing Approach Handle Both?:提出NQG-T5模型,致力于解决seq2seq模型难以解决的域外compositional generalization问题,结合高精度的、基于语法的方法NQG和预训练seq2seq模型T5,在真实数据和标准评估数据上都表现良好。对于域内样本直接输出NQG,域外样本则输出T5结果。
- (ACL | IJCNLP) Measuring and Improving BERT’s Mathematical Abilities by Predicting the Order of Reasoning:用特殊的训练过程
- (NAACL) Are NLP Models really able to Solve Simple Math Word Problems?
- Training Verifiers to Solve Math Word Problems
- Pretrained Language Models are Symbolic Mathematics Solvers too!
- 数值表征
- 数值推理
- MathBERT: A Pre-Trained Model for Mathematical Formula Understanding:第一个用于理解数学公式的预训练模型。预训练任务是预测从操作符树(OPT,公式的语义结构表示)中提取的掩码公式子结构,下游任务是数学信息检索、公式主题分类和公式标题生成
- (NeurIPS MATHAI4ED Workshop) MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education:这个应该也算数值推理吧 预训练数据集:从学前到大学 任务:知识组件预测、自动分级开放式问答和知识追踪 除此之外,本文还构建了数学领域的专属词典mathVocab
2020年
- 数值推理
- (EMNLP) Question Directed Graph Attention Network for Numerical Reasoning over Text:改进NumNet,用异质有向图将类型(单位)和实体信息也结合进来,做数值推理
- 数值常识
- (EMNLP) Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models:通过数值表征学习,LLM能获得数值所处的大致区间
- MWP
- (EMNLP) Semantically-Aligned Universal Tree-Structured Solver for Math Word Problems
- (EMNLP) A Knowledge-Aware Sequence-to-Tree Network for Math Word Problem Solving:KA-S2T模型,用基于树的表示学习方法,但结合了外部的常识性知识:用LSTM对问题进行嵌入,将问题中的实体和类型构建为实体图,用GAT结合外部知识实现表征,用tree-based decoder聚合state,以捕获长程依赖和全局表达式信息。 官方GitHub项目:https://github.com/qinzhuowu/KA-S2T
- (EMNLP) Point to the Expression: Solving Algebraic Word Problems using the Expression-Pointer Transformer Model
- (ICLR) Deep Learning For Symbolic Mathematics:这个解决的是符号积分和微分方程方面。这个算不算MWP我都不知道,反正先放到这里吧。
- (ICML) Mapping Natural-language Problems to Formal-language Solutions Using Structured Neural Representations
- (COLING) Solving Math Word Problems with Multi-Encoders and Multi-Decoders:用多种encoder和decoder来解决MWP任务:同时利用文本表征和将文本处理为依存句法树和数值比较信息的图后用图神经网络编码得到的表征,decoder也同时用基于序列和基于树的,最后会生成不同的公式,用这两个公式的损失函数合并为整个模型的优化目标。在推理时选择概率比较大的公式。
- (IEEE Transactions on Pattern Analysis and Machine Intelligence) The Gap of Semantic Parsing: A Survey on Automatic Math Word Problem Solvers:综述
- 数值表征
- (ACL) Injecting Numerical Reasoning Skills into Language Models:logarithmic difference能够给小数字更高权重
- (KDD) Self-Supervised Pretraining of Graph Neural Network for the Retrieval of Related Mathematical Expressions in Scientific Articles:检索相似公式→检索相似论文
2019年
- 数值推理
- (NAACL) DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs:实现数值之间的计数、加减等操作
- (EMNLP) NumNet: Machine Reading Comprehension with Numerical Reasoning:数值+GNN+数值之间的比较关系→在上下文中实现数值推理 代码中文版:j30206868/numnet-chinese: Modify numnet+ for chinese
- (EMNLP | IJCNLP) Giving BERT a Calculator: Finding Operations and Arguments with Reading Comprehension:用BERT选择可执行程序(预定义好的)
- MWP
- (IJCAI) A Goal-Driven Tree-Structured Neural Model for Math Word Problems
- (AAAI) Template-Based Math Word Problem Solvers with Recursive Neural Networks
- (NAACL) MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
- (NAACL) Semantically-Aligned Equation Generation for Solving and Reasoning Math Word Problems:将推理过程结合进了seq2seq模型之中:先用encoder表征问题中常数的语义信息(理解问题的语义),再用decoder依次决定公式中的数值和运算符,以模拟人类的推理逻辑。在decoder中,从原文中抽取或生成的数字组合成栈,逐一输出并生成匹配的运算符。最后生成的效果优于直接使用seq2seq模型
2018年
- (ACL) Numeracy for Language Models: Evaluating and Improving their Ability to Predict Numbers:用MAPE (median absolute percentage error) 作为损失函数
- MWP
- (EMNLP) Translating a Math Word Problem to a Expression Tree:本文认为解决问题可以用多种形式的公式,这是seq2seq方法所无法解决的,因此将MWP问题映射为树(是对公式进行正则化)来建模。模型是encoder-decoder架构,将多种模型ensemble在一起,生成公式树的postorder traversal 在Solving Math Word Problems with Multi-Encoders and Multi-Decoders中模型被称为Math-EN
- (AAAI) MathDQN: Solving Arithmetic Word Problems via Deep Reinforcement Learning1:强化学习。抽取数值,对数值两两配对、提取数值对的特征,结合上下文形成state,输入神经网络选择action,然后判断action选择正确与否(这个是选择运算符),正确reward为1,否则回去训练神经网络 官方GitHub项目:uestc-db/DQN_Word_Problem_Solver(Python 2的代码,不要啊,我不要复现这种东西) (我自己不是搞强化学习的也没看论文所以妹整明白,就是它……运算符的选择空间不是挺小的吗?这事真的需要强化学习吗?)
- (COLING) Neural Math Word Problem Solver with Reinforcement Learning:CASS模型。强化学习。将复制与对齐机制结合进seq2seq模型,以解决生成数据不真实、错位的问题。强化学习训练框架使模型的训练损失函数与评估指标统一了。CASS也用模型输出作为分类模型的特征输入。
2017年
- MWP
- (EMNLP) Re43:读论文 DNS Deep Neural Solver for Math Word Problems:第一篇用神经网络解决MWP问题的论文,直接将问题用RNN映射为公式。然后用结合RNN和基于相似度的检索模型,当检索模型相似度得分高于阈值时用检索结果的公式模版,反之用RNN。
- (ACL) Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems
- (EACL) Annotating Derivations: A New Evaluation Strategy and Dataset for Algebra Word Problems
2016年
- MWP
- (ACL) How well do Computers Solve Math Word Problems? Large-Scale Dataset Construction and Evaluation:发现简单的基于相似度的方法就已经能超过大多数统计学习模型了
- (ACL) Learning To Use Formulas To Solve Simple Arithmetic Problems
2015年
- MWP
- (EMNLP) Automatically Solving Number Word Problems by Semantic Parsing and Reasoning
- (EMNLP) Learn to Solve Algebra Word Problems Using Quadratic Programming:提出ZDC模型(KAZB模型的改进)
- (EMNLP) Solving General Arithmetic Word Problems
- (TACL) Reasoning about Quantities in Natural Language:数值检测、Quantity Entailment和MWP任务
2014年
- MWP
- (EMNLP) Re42:读论文 ARIS Learning to Solve Arithmetic Word Problems with Verb Categorization:第一篇非基于模板解决MWP的方法,解决加减算术问题。预测动词类型来进行题目分类,以及考虑其他一些人工抽取的特征,抽取题目中的实体、数值等信息,根据状态转移表得到公式
- (ACL) Learning to Automatically Solve Algebra Word Problems:提出KAZB模型,是基于模版的方法:将问题映射为训练集中已有的公式模版
2011年
2009年
- (SIGIR) Learning to rank for quantity consensus queries:检索任务,根据数值排序
1963年
- MWP
- Computers and thought
- 纯数学网站