Large Language Monkeys

This repository provides the accompanying code for Large Language Monkeys: Scaling Inference Compute with Repeated Sampling.

Specifically, the code needed to:

Generate samples from various datasets and models.
Evaluate the correctness of the samples.

Four datasets are supported:

GSM8K
MATH
CodeContests
MiniF2F-MATH

We use vLLM to do inference, so any models that they support will work with our generation scripts.

Installation

We use two different conda environments for this project, as the lean-dojo version we use requires Python 3.9.19.

Environment for MiniF2F-MATH

conda create -n llmonk-minif2f python=3.9.19
pip install -r requirements_minif2f.txt

To run evaluation on this dataset, we additionally need to install lean4. To do this, follow the installation instructions for your system according to this website.

When prompted with

Current installation options:

  default toolchain: stable
  modify PATH variable: yes

1) Proceed with installation (default)
2) Customize installation
3) Cancel installation

Choose 2, and change the default toolchain to: 4.3.0-rc2.

Evironment for everything except MiniF2F-MATH

conda create -n llmonk python=3.11.8
pip install -r requirements.txt

Repository Structure

The repo is organized as follows:

large-language-monkeys/
├── llmonk/
│   ├── evaluate/
│   │   ├── gsm8k.py
│   │   ├── math.py
│   │   ├── code_contests.py
│   │   └── minif2f.py
│   ├── generate/
│   │   ├── gsm8k.py
│   │   ├── math.py
│   │   ├── code_contests.py
│   │   └── minif2f.py
│   └── tests/
│   │   ├── math_datasets.py
│   │   ├── code_contests.py
│   │   └── minif2f.py
├── README.md
└── requirements.txt

llmonk/evaluate/: contains the code to evaluate dataset samples
llmonk/generate/: contains the code to generate samples from a model
llmonk/tests/: contains code to check the correctness of our evaluation scripts

Within each folder, there is a file for each of the supported datasets (note that the scripts for MATH and GSM8K are combined under "math_datasets" for evaluation and testing).

Generation Scripts

These scripts are used to generate samples from a model for a dataset.

Usage

Each file has two mandatory arguments:

model: the huggingface model to use to generate the samples (same string you would pass to .from_pretrained)
save_dir: the directory to save the samples

For the remaining optional arguments (ex. temperature, number of samples, batch size, vllm arguments), please see the GenerateScriptConfig class in llmonk/utils.py.

Output Format

The samples are saved as YAML files (one YAML file per problem). Every dataset's YAML file contains the following keys:

prompt: the prompt for the problem
question: the current question for the problem
samples: a list of samples for each problem

For GSM8K and MATH, there is the additional key:

gt_answer: the dataset's ground truth answer for the problem

For CodeContests, there is the additional key:

test_cases: dictionary with the following keys:
- input: list of strings corresponding to test case inputs
- output: list of strings corresponding to test case outputs

For MiniF2F-MATH, there is the additional key:

theorem_name: the name of the theorem to be proven

Evaluation Scripts

These scripts evaluate the correctness of the samples generated by the generation scripts.

Usage

Each file has two mandatory arguments:

samples_dir: the directory to the samples
save_dir: the directory to save the evaluation results

For the remaining optional arguments (ex. number of workers), please see the EvaluateScriptConfig class in llmonk/utils.py.

Output Format

The evaluation results are saved as YAML files (one YAML file per problem), in the same format as the samples generated by the generation scripts with the additional key:

is_correct: a list of booleans indicating whether each sample is correct, is_correct[i] is True if and only if samples[i] is correct

Testing Scripts

The llmonk/tests/ directory contains unit tests to evaluate the correctness of the evaluation scripts.

Example Commands

See commands.md for examples of how to run generation, evaluation, and testing.

Citation

If you use this code in your research, please cite our paper. You can use the following BibTeX entry:

@misc{brown2024largelanguagemonkeysscaling,
      title={Large Language Monkeys: Scaling Inference Compute with Repeated Sampling}, 
      author={Bradley Brown and Jordan Juravsky and Ryan Ehrlich and Ronald Clark and Quoc V. Le and Christopher Ré and Azalia Mirhoseini},
      year={2024},
      eprint={2407.21787},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2407.21787}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large Language Monkeys

Installation

Environment for MiniF2F-MATH

Evironment for everything except MiniF2F-MATH

Repository Structure

Generation Scripts

Usage

Output Format

Evaluation Scripts

Usage

Output Format

Testing Scripts

Example Commands

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
llmonk		llmonk
LICENSE		LICENSE
README.md		README.md
commands.md		commands.md
requirements.txt		requirements.txt
requirements_minif2f.txt		requirements_minif2f.txt
setup.py		setup.py

License

ScalingIntelligence/large_language_monkeys

Folders and files

Latest commit

History

Repository files navigation

Large Language Monkeys

Installation

Environment for MiniF2F-MATH

Evironment for everything except MiniF2F-MATH

Repository Structure

Generation Scripts

Usage

Output Format

Evaluation Scripts

Usage

Output Format

Testing Scripts

Example Commands

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages