This repository provides the accompanying code for Large Language Monkeys: Scaling Inference Compute with Repeated Sampling.
Specifically, the code needed to:
- Generate samples from various datasets and models.
- Evaluate the correctness of the samples.
Four datasets are supported:
- GSM8K
- MATH
- CodeContests
- MiniF2F-MATH
We use vLLM to do inference, so any models that they support will work with our generation scripts.
We use two different conda environments for this project, as the lean-dojo version we use requires Python 3.9.19.
conda create -n llmonk-minif2f python=3.9.19
pip install -r requirements_minif2f.txt
To run evaluation on this dataset, we additionally need to install lean4. To do this, follow the installation instructions for your system according to this website.
When prompted with
Current installation options:
default toolchain: stable
modify PATH variable: yes
1) Proceed with installation (default)
2) Customize installation
3) Cancel installation
Choose 2, and change the default toolchain to: 4.3.0-rc2
.
conda create -n llmonk python=3.11.8
pip install -r requirements.txt
The repo is organized as follows:
large-language-monkeys/
├── llmonk/
│ ├── evaluate/
│ │ ├── gsm8k.py
│ │ ├── math.py
│ │ ├── code_contests.py
│ │ └── minif2f.py
│ ├── generate/
│ │ ├── gsm8k.py
│ │ ├── math.py
│ │ ├── code_contests.py
│ │ └── minif2f.py
│ └── tests/
│ │ ├── math_datasets.py
│ │ ├── code_contests.py
│ │ └── minif2f.py
├── README.md
└── requirements.txt
llmonk/evaluate/
: contains the code to evaluate dataset samplesllmonk/generate/
: contains the code to generate samples from a modelllmonk/tests/
: contains code to check the correctness of our evaluation scripts
Within each folder, there is a file for each of the supported datasets (note that the scripts for MATH and GSM8K are combined under "math_datasets" for evaluation and testing).
These scripts are used to generate samples from a model for a dataset.
Each file has two mandatory arguments:
model
: the huggingface model to use to generate the samples (same string you would pass to.from_pretrained
)save_dir
: the directory to save the samples
For the remaining optional arguments (ex. temperature, number of samples, batch size, vllm arguments), please see the GenerateScriptConfig
class in llmonk/utils.py
.
The samples are saved as YAML files (one YAML file per problem). Every dataset's YAML file contains the following keys:
prompt
: the prompt for the problemquestion
: the current question for the problemsamples
: a list of samples for each problem
For GSM8K and MATH, there is the additional key:
gt_answer
: the dataset's ground truth answer for the problem
For CodeContests, there is the additional key:
test_cases
: dictionary with the following keys:input
: list of strings corresponding to test case inputsoutput
: list of strings corresponding to test case outputs
For MiniF2F-MATH, there is the additional key:
theorem_name
: the name of the theorem to be proven
These scripts evaluate the correctness of the samples generated by the generation scripts.
Each file has two mandatory arguments:
samples_dir
: the directory to the samplessave_dir
: the directory to save the evaluation results
For the remaining optional arguments (ex. number of workers), please see the EvaluateScriptConfig
class in llmonk/utils.py
.
The evaluation results are saved as YAML files (one YAML file per problem), in the same format as the samples generated by the generation scripts with the additional key:
is_correct
: a list of booleans indicating whether each sample is correct,is_correct[i]
is True if and only ifsamples[i]
is correct
The llmonk/tests/
directory contains unit tests to evaluate the correctness of the evaluation scripts.
See commands.md
for examples of how to run generation, evaluation, and testing.
If you use this code in your research, please cite our paper. You can use the following BibTeX entry:
@misc{brown2024largelanguagemonkeysscaling,
title={Large Language Monkeys: Scaling Inference Compute with Repeated Sampling},
author={Bradley Brown and Jordan Juravsky and Ryan Ehrlich and Ronald Clark and Quoc V. Le and Christopher Ré and Azalia Mirhoseini},
year={2024},
eprint={2407.21787},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.21787},
}