This is the official repository for implementing the paper Process-Driven Autoformalization in Lean 4.
The autoformalization dataset Formalization for Lean 4 (forml4) is released and located in data/FormL4
. We will release the code and corresponding finetuned autoformalizer and process-enhanced verifier soon.
An overview of process-driven autoformalization (PDA):
- FORML4 dataset is constructed by prompting Gemini-1.5-Pro to informalize theorems extracted from Mathlib 4;
- An autoformalizer model is trained on FORML4, with output sent to the Lean 4 Compiler for automated feedback;
- The feedback can provide process-level annotations for the autoformalizer's output which are used to train an effective process-supervised verifier (PSV);
- For further enhancement, the autoformalizer is subsequently fine-tuned by the verifier’s feedback, while the verifier can again benefit from the improved autoformalizer’s higher-quality output data.
PDA
├── annotation
│ ├── informalization_autoformalization
│ ├── model_comparison
│ └── README.md
├── code
│ ├── Automatic-Lean4-Compilation
│ ├── Dataset-Informalization
│ ├── google-cloud-sdk
│ └── Training-Scripts
├── data
│ ├── FormL4
│ │ ├── basic_test.json
│ │ ├── random_test.json
│ │ ├── real_test.json
│ │ └── train.json
│ └── raw.zip
├── prompt
└── requirements.txt
The data
directory includes both training and testing data (forml4) for training and testing the autoformalization task in Lean 4. It contains:
- The
raw.zip
contains raw theorems extracted from mathlib4; - The
FormL4
folder contains the final FormL4 dataset, constructed from LLM informalization and manual processing. See the paper for details about data filtering, construction, etc.
The forml4 dataset is organized into the following files, ensuring a comprehensive approach to both training and testing your models:
train.json
: The training data file containing 14,250 examples to train models. It includes both the formal and informal statements necessary for learning the autoformalization.basic_test.json
: The basic test set (970) is specifically designed to evaluate a model's capability to formalize fundamental theorems.random_test.json
: The random test set (950) contains a diverse and randomly selected set of problems.real_test.jsonl
: The real test set represents the out-of-domain test set, featuring 967 natural language mathematics questions and answers distilled from the Arithmo test set.
-
Dataset-Informalization
: This folder contains code used to construct the FormL4 dataset. See theREADME.md
in the folder for details about replication. -
Automatic-Lean4-Compilation
: Please visit our GitHub repository for more details about how to automate your Lean 4 installation and compilation: Automatic Lean 4 Compilation Guide -
Training/Evaluating autoformalizer and verifier based on (forml4)
Human annotations are provided in the annotation
sub-directory. It contains human evaluation results for three stages of PDA:
- Before informalization (stage I): We conduct a human evaluation on cross-model preference comparison between gpt-4o and gemini-pro-1.5 (sample size = 10);
- After informalization (stage II): After informalizating the FormL4 dataset, we conduct a manual quality check on informalization success by randomly sampling from three different test sets (sample size = 60).
- After autoformalization training (stage III): After completing training and enhancing the autoformalizer model, the model's autoformalization output was evaluated by human evaluators (sample size = 80).
See the README.md
in the sub-directory for details.
Model | Random | Basic | Real | ||||||
---|---|---|---|---|---|---|---|---|---|
Greedy | Pass@1 | Pass@5 | Greedy | Pass@1 | Pass@5 | Greedy | Pass@1 | Pass@5 | |
GPT-3.5-Turbo | 0.41 | 0.32 | 0.73 | 0.29 | 0.00 | 0.66 | 5.10 | 3.80 | 17.00 |
GPT-4-Turbo | 0.49 | 0.41 | 3.42 | 1.47 | 1.14 | 4.38 | 10.20 | 8.70 | 25.10 |
DeepSeek-Math-Base-7B | 0.17 | 0.21 | 0.95 | 0.34 | 0.22 | 0.81 | 0.00 | 0.00 | 0.00 |
DeepSeek-Math-Instruct-7B | 0.58 | 0.22 | 1.71 | 1.16 | 0.47 | 3.04 | 0.30 | 1.60 | 5.30 |
LLEMMA-7B | 0.00 | 0.00 | 0.77 | 0.17 | 0.10 | 0.45 | 0.00 | 0.00 | 0.00 |
LLEMMA-34B | 0.00 | 0.00 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
InternLM-Math-7B | 0.00 | 0.00 | 0.18 | 0.19 | 0.14 | 0.26 | 1.10 | 1.00 | 3.70 |
InternLM-Math-20B | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.20 | 0.70 | 2.30 |
Mistral-Instruct-v0.2-7B | 0.28 | 0.21 | 1.86 | 0.45 | 0.77 | 1.82 | 0.30 | 0.50 | 1.90 |
We evaluate and compare the performance of an enhanced autoformalization model that incorporates a verifier in its process. This model is denoted as "RFT + Verifier (Ours)" and is compared against baseline and RFT (Rejective Sampling Fine-tuning) models. The RFT approach selects outputs that successfully compile for further fine-tuning of the baseline model, whereas our method additionally checks for correctness as evaluated by the verifier. The results are summarized in the following table.
Model | Quantity | Quality (Acc.) | Test Sets - Basic | Test Sets - Random | Test Sets - Real |
---|---|---|---|---|---|
Baseline | - | - | 40.92 | 35.88 | 23.90 |
RFT | + 65K | 100% | 44.50 | 38.70 | 26.50 |
Verifier (Ours) | + 74K | 80.50% | 43.80 | 38.04 | 25.70 |
RFT + Verifier (Ours) | + 60K | 100% | 46.28 | 39.38 | 27.90 |
Next, we further evaluated an enhanced verifier by applying it to outputs from the RFT+Verifier model. The evaluation was conducted in relation to a pre-defined experimental setup, comparing process-supervised training (PSV +), which builds upon the previous PSV model, and outcome-supervised training (OSV +), which improves upon OSV. The results can be seen in the table below.
Dataset | OSV MP1 |
OSV Acc |
OSV Recall |
OSV + MP1 |
OSV + Acc |
OSV + Recall |
PSV MP1 |
PSV Acc |
PSV Recall |
PSV + MP1 |
PSV + Acc |
PSV + Recall |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Basic | 40.71 | 81.34 | 80.12 | 45.13 | 84.22 | 83.18 | 41.49 | 84.59 | 82.73 | 47.30 | 89.14 | 94.17 |
Random | 36.14 | 81.16 | 81.07 | 38.21 | 83.13 | 84.45 | 37.52 | 84.68 | 83.47 | 44.32 | 84.20 | 93.70 |
Real | 25.75 | 84.45 | 86.21 | 33.41 | 84.45 | 86.21 | 33.42 | 84.34 | 81.08 | 45.10 | 94.18 | 89.31 |
To contribute to this project, please follow the guidelines provided in CONTRIBUTING.md
. We welcome contributions from the community, including bug fixes, enhancements, or documentation.
This work is licensed under CC BY 4.0, a Creative Commons Attribution 4.0 International License.
If you find the methods and resources provided in this work beneficial, we kindly request that you cite our paper. Below is the BibTeX entry for your convenience:
@article{lu2024process,
title={Process-Driven Autoformalization in Lean 4},
author={Lu, Jianqiao and Liu, Zhengying and Wan, Yingjia and Huang, Yinya and Wang, Haiming and Yang, Zhicheng and Tang, Jing and Guo, Zhijiang},
journal={arXiv preprint arXiv:2406.01940},
year={2024}
}
Thank you for considering our work for your research.
Feel free to reach out with questions or for collaboration.