MathGPT

Code for the paper Tree-Based Representation and Generation of Natural and Mathematical Language

If you end up using this code in your research, please cite us like:

@inproceedings{scarlatos-lan-2023-tree,
    title = "Tree-Based Representation and Generation of Natural and Mathematical Language",
    author = "Scarlatos, Alexander and Lan, Andrew",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.205",
    pages = "3714--3730",
}

Setup

Python Environment

Ensure Python3 is installed (this code was tested on v3.9.1).

Create virtual environment

python3 -m venv <env_name>
source <env_name>/bin/activate

Install libraries

python3 -m pip install -r requirements.txt

Make TangentCFT available in Python path

export PYTHONPATH=..:../TangentCFT/

External Dependencies

The following need to be installed for full functionality. LaTeXML is only required to run pre-processing.

TangentCFT

Download TangentCFT to the folder above the root of this repo: https://github.com/BehroozMansouri/TangentCFT/tree/2b189dff67d6d3b15e323358921bdbca779bfcd9

Note that some small fixes were made to TangentCFT, so the file semantic_symbol.py is copied from there with some changes and is overloaded automatically.

LaTeXML

https://math.nist.gov/~BMiller/LaTeXML/get.html

NLG-Eval

https://github.com/Maluuba/nlg-eval

Known installation issue: Maluuba/nlg-eval#61

Data

Here are links to the datasets required for the following tasks. For each, ensure the dataset's root folder is above the root of this repo.

Pre-Training: https://ntcir-math.nii.ac.jp/data/
- just the Wikipedia Corpus
Headline Generation: https://github.com/yuankepku/MathSum
Equation Extraction: https://ai.tencent.com/ailab/nlp/dialogue/#datasets
- The dataset will have to be translated to English; Google Translate is acceptable.
- See pre_process.py->process_mwp_data for desired final data format.
Student Action Prediction: https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=660
GSM8K: https://github.com/openai/grade-school-math
MATH: https://github.com/hendrycks/math

The datasets for the following tasks cannot be released publicly:

Answer Scoring
Feedback Generation

Run

The starting point for all code is __main__.py, and you can see a list of command line options by running:

python3 __main__.py --help

Default values can be found in the TrainOptions constructor in utils.py.

Here is the typical workflow to replicate our experiments:

Pre-process the Wikipedia dataset (this step also constructs the vocabulary, which is needed for all following steps)
Pre-train a MathGPT model
Pre-process the downstream dataset
Run cross-validation on the downstream dataset, which for each fold:
- Fine-tunes the pre-trained MathGPT model on the downstream dataset
- Runs evaluation on the downstream dataset's test set

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
ref_data		ref_data
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
README.md		README.md
__main__.py		__main__.py
analyze_data.py		analyze_data.py
analyze_model.py		analyze_model.py
constants.py		constants.py
data_types.py		data_types.py
decode.py		decode.py
evaluate.py		evaluate.py
generate.py		generate.py
lint.sh		lint.sh
loading.py		loading.py
math_tokenize.py		math_tokenize.py
model_baseline.py		model_baseline.py
model_math_gpt.py		model_math_gpt.py
mypy.sh		mypy.sh
pre_process.py		pre_process.py
pre_process_utils.py		pre_process_utils.py
requirements.txt		requirements.txt
sanity_check.py		sanity_check.py
semantic_symbol.py		semantic_symbol.py
setup.sh		setup.sh
symbol_map.py		symbol_map.py
training.py		training.py
utils.py		utils.py
vocabulary.py		vocabulary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MathGPT

Setup

Python Environment

External Dependencies

TangentCFT

LaTeXML

NLG-Eval

Data

Run

About

Releases

Packages

Languages

umass-ml4ed/mathGPT

Folders and files

Latest commit

History

Repository files navigation

MathGPT

Setup

Python Environment

External Dependencies

TangentCFT

LaTeXML

NLG-Eval

Data

Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages