Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduces AutoGenBench #1048

Merged
merged 48 commits into from
Jan 26, 2024
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
d68a8f6
Initial commit of AutoGenBench
afourney Dec 22, 2023
23fd4bf
Fixed merge conflicts
afourney Dec 22, 2023
e2517f2
Merge branch 'main' into autogenbench
qingyun-wu Dec 25, 2023
617bb5e
wording
qingyun-wu Dec 25, 2023
fcf279b
typo
qingyun-wu Dec 25, 2023
6e9c3b7
pre-commit reformulation
qingyun-wu Dec 25, 2023
a4a4134
Merge branch 'main' into autogenbench
afourney Jan 2, 2024
f80f40b
Merge main
afourney Jan 8, 2024
051d0be
Updated README to point to contributor's guide earlier.
afourney Jan 8, 2024
a81334b
Merge branch 'main' into autogenbench
afourney Jan 11, 2024
5430191
Simplified the description of the JSON format.
afourney Jan 12, 2024
485c923
Merge branch 'main' into autogenbench
afourney Jan 17, 2024
53dbe7b
Added print statements to indicate when run.sh and scenario.py are st…
afourney Jan 17, 2024
7bd7803
Merge branch 'main' into autogenbench
afourney Jan 18, 2024
ec16e52
Added SocietyOfMind scenario to GAIA.
afourney Jan 18, 2024
ef1ff83
Pointing autogenbench clone command to the latest branch.
afourney Jan 18, 2024
84aaa0d
Temporarily disable subsample option.
afourney Jan 19, 2024
e571bc7
Updated the GAIA readme to specify how to define a BING API key.
afourney Jan 19, 2024
85ab477
Fixed and re-enabled the subsample option.
afourney Jan 19, 2024
443c83e
Merge branch 'main' into autogenbench
afourney Jan 22, 2024
d20e5ca
Added a draft of a blog post.
afourney Jan 22, 2024
91bc8b7
Updated authors.
afourney Jan 22, 2024
6249a8d
Incorporating Gagan's feedback.
afourney Jan 22, 2024
07b5698
Fixed code formatting.
afourney Jan 22, 2024
a0bbca5
Updated the help string in the docs.
afourney Jan 22, 2024
b8cda92
Light editing of the AutoGenBench blogpost.
afourney Jan 22, 2024
2776af3
Support filtering on model tags.
afourney Jan 22, 2024
f331394
Merge branch 'main' into autogenbench
afourney Jan 23, 2024
fdbfbc7
Added websurfer dependencies to Dockerfile.
afourney Jan 23, 2024
a29a581
Renamed testbed -> autogenbench
afourney Jan 24, 2024
d99885b
Merge branch 'main' into autogenbench
afourney Jan 24, 2024
25be588
Attempting to fix formatting.
afourney Jan 24, 2024
0abc571
Added more gracefull handling of task timeouts (the script is allowed…
afourney Jan 24, 2024
3576604
Updated the blogpost based on Saleema's and Julia's feedback.
afourney Jan 24, 2024
e00f442
Fixed formatting... again.
afourney Jan 24, 2024
f91cdd2
Merge branch 'main' into autogenbench
afourney Jan 25, 2024
f8f76e3
Added a main MANIFEST to list available scenarios.
afourney Jan 25, 2024
ae14045
Limit main manifest to directories.
afourney Jan 25, 2024
c71a235
Manifests now use relative paths.
afourney Jan 25, 2024
768f6d0
All manifests are now relative.
afourney Jan 25, 2024
41fb911
Updated the contributing guide, and address windows path issues.
afourney Jan 25, 2024
ee49d7d
Updated the version. Fixed formatting.
afourney Jan 25, 2024
1b428cf
Fixed formatting.
afourney Jan 25, 2024
6089571
De-listing Examples, since it has no clear tabulate criteria.
afourney Jan 25, 2024
9e1bf4c
Updated email in pyproject
afourney Jan 26, 2024
ac1f7e8
typo in blogpost
qingyun-wu Jan 26, 2024
7bc02e9
Merge branch 'autogenbench' of github.com:microsoft/autogen into auto…
qingyun-wu Jan 26, 2024
54d509c
wording
qingyun-wu Jan 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
188 changes: 188 additions & 0 deletions samples/tools/autogenbench/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# Contributing to AutoGenBench

As part of the broader AutoGen project, AutoGenBench welcomes community contributions. Contributions are subject to AutoGen's [contribution guidelines](https://microsoft.github.io/autogen/docs/Contribute), as well as a few additional AutoGenBench-specific requirements outlined here. You may also wish to develop your own private benchmark scenarios and the guidance in this document will help with such efforts as well. Below you will find the general requirements, followed by a detailed technical description.

## General Contribution Requirements
We ask that all contributions to AutoGenBench adhere to the following:

- Follow AutoGen's broader [contribution guidelines](https://microsoft.github.io/autogen/docs/Contribute)
- All AutoGenBench benchmarks should live in a subfolder of `/samples/tools/autogenbench/scenarios` alongside `HumanEval`, `GAIA`, etc.
- Benchmark scenarios should include a detailed README.md, in the root of their folder, describing the benchmark and providing citations where warranted.
- Benchmark data (tasks, ground truth, etc.) should be downloaded from their original sources rather than hosted in the AutoGen repository (unless the benchmark is original, and the repository *is* the original source)
- You can use the `Scripts/init_tasks.py` file to automate this download.
- Basic scoring should be compatible with the `autogenbench tabulate` command (e.g., by outputting logs compatible with the default tabulation mechanism, or by providing a `Scripts/custom_tabulate.py` file)
- If you wish your benchmark to be compatible with the `autogenbench clone` command, include a `MANIFEST.json` file in the root of your folder.

These requirements are further detailed below, but if you simply copy the `HumanEval` folder, you will already be off to a great start.

## Implementing and Running Benchmark Tasks
At the core of any benchmark is a set of tasks. To implement tasks that are runnable by AutoGenBench, you must adhere to AutoGenBench's templating and scenario expansion algorithms, as outlined below.

### Task Definitions

All tasks are stored in JSONL files (in subdirectories under `./Tasks`). Each line of a tasks file is a JSON object with the following schema:

```
{
"id": string,
"template": dirname,
"substitutions" {
"filename1": {
"find_string1_1": replace_string1_1,
"find_string1_2": replace_string1_2,
...
"find_string1_M": replace_string1_N
}
"filename2": {
"find_string2_1": replace_string2_1,
"find_string2_2": replace_string2_2,
...
"find_string2_N": replace_string2_N
}
}
}
```

For example:

```
{
"id": "two_agent_stocks_gpt4",
"template": "default_two_agents",
"substitutions": {
"scenario.py": {
"__MODEL__": "gpt-4",
},
"prompt.txt": {
"__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD."
}
}
}
```

In this example, the string `__MODEL__` will be replaced in the file `scenarios.py`, while the string `__PROMPT__` will be replaced in the `prompt.txt` file.

The `template` field can also take on a list value, but this usage is considered advanced and is not described here. See the `autogenbench/run_cmd.py` code, or the `GAIA` benchmark tasks files for additional information about this option.


## Task Instance Expansion Algorithm

Once the tasks have been defined, as per above, they must be "instantiated" before they can be run. This instantiation happens automatically when the user issues the `autogenbench run` command and involves creating a local folder to share with Docker. Each instance and repetition gets its own folder along the path: `./results/[scenario]/[task_id]/[instance_id]`. For the sake of brevity we will refer to this folder as the `DEST_FOLDER`.

The algorithm for populating the `DEST_FOLDER` is as follows:

1. Pre-populate DEST_FOLDER with all the basic starter files for running a scenario (found in `autogenbench/template`).
2. Recursively copy the template folder specified in the JSONL line to DEST_FOLDER (if the JSON `template` attribute points to a folder) If the JSONs `template` attribute instead points to a file, copy the file, but rename it to `scenario.py`
3. Apply any string replacements, as outlined in the prior section.
4. Write a run.sh file to DEST_FOLDER that will be executed by Docker when it is loaded. The `run.sh` is described below.

## Scenario Execution Algorithm

Once the task has been instantiated it is run (via run.sh). This script will execute the following steps:

1. If a file named `global_init.sh` is present, run it.
2. If a file named `scenario_init.sh` is present, run it.
3. Install the requirements.txt file (if running in Docker)
4. Run the task via `python scenario.py`
5. If the scenario.py exited cleanly (exit code 0), then print "SCENARIO.PY COMPLETE !#!#"
6. Clean up (delete cache, etc.)
7. If a file named `scenario_finalize.sh` is present, run it.
8. If a file named `global_finalize.sh` is present, run it.
9. echo "RUN COMPLETE !#!#", signaling that all steps completed.

Notably, this means that scenarios can add custom init and teardown logic by including `scenario_init.sh` and `scenario_finalize.sh` files.

At the time of this writing, the run.sh file is as follows:

```sh
export AUTOGEN_TESTBED_SETTING="Docker"
umask 000

# Run the global init script if it exists
if [ -f global_init.sh ] ; then
. ./global_init.sh
fi

# Run the scenario init script if it exists
if [ -f scenario_init.sh ] ; then
. ./scenario_init.sh
fi

# Run the scenario
pip install -r requirements.txt
python scenario.py
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
echo SCENARIO.PY EXITED WITH CODE: $EXIT_CODE !#!#
else
echo SCENARIO.PY COMPLETE !#!#
fi

# Clean up
if [ -d .cache ] ; then
rm -Rf .cache
fi

# Run the scenario finalize script if it exists
if [ -f scenario_finalize.sh ] ; then
. ./scenario_finalize.sh
fi

# Run the global finalize script if it exists
if [ -f global_finalize.sh ] ; then
. ./global_finalize.sh
fi

echo RUN.SH COMPLETE !#!#
```

Be warned that this listing is provided here for illustration purposes, and may vary over time. The source of truth are the `run.sh` files found in the ``./results/[taskset]/[task_id]/[instance_id]`` folders.


## Integrating with the `tabulate` and `clone` commands.

The above details are sufficient for defining and running tasks, but if you wish to support the `autogenbench tabulate` and `autogenbench clone` commands, a few additional steps are required.

### Tabulations

If you wish to leverage the default tabulation logic, it is as simple as arranging your `scenario.py` file to output the string "ALL TESTS PASSED !#!#" to the console in the event that a task was solved correctly.

If you wish to implement your own tabulation logic, simply create the file `Scripts/custom_tabulate.py` and include a `main(args)` method. Here, the `args` parameter will be provided by AutoGenBench, and is a drop-in replacement for `sys.argv`. In particular, `args[0]` will be the invocation command (similar to the executable or script name in `sys.argv`), and the remaining values (`args[1:]`) are the command line parameters.

Should you provide a custom tabulation script, please implement `--help` and `-h` options for documenting your interface.

The `scenarios/GAIA/Scripts/custom_tabulate.py` is a great example of custom tabulation. It also shows how you can reuse some components of the default tabulator to speed up development.


### Cloning

If you wish your benchmark to be available via the `autogenbench clone` command, you will need to take three additional steps:

#### Manifest
First, provide a `MANIFEST.json` file in the root of your benchmark. An example is provided below, from which you can see the schema:

```json
{
"files": {
"Templates/TwoAgents/prompt.txt": "samples/tools/autogenbench/scenarios/HumanEval/Templates/TwoAgents/prompt.txt",
"Templates/TwoAgents/coding/my_tests.py": "samples/tools/autogenbench/scenarios/HumanEval/Templates/TwoAgents/coding/my_tests.py",
"Templates/TwoAgents/scenario.py": "samples/tools/autogenbench/scenarios/HumanEval/Templates/TwoAgents/scenario.py",
"README.md": "samples/tools/autogenbench/scenarios/HumanEval/README.md",
"Scripts/init_tasks.py": "samples/tools/autogenbench/scenarios/HumanEval/Scripts/init_tasks.py",
"Scripts/custom_tabulate.py": "samples/tools/autogenbench/scenarios/HumanEval/Scripts/custom_tabulate.py"
}
}
```

The keys of the `files` dictionary are local paths, relative to your benchmark's root directory. The values are relative paths in the AutoGen GitHub repository (relative to `https://raw.githubusercontent.com/microsoft/autogen/{BRANCH}/`), where {BRANCH} is defined in `autogenbench/clone_cmd.py`.

#### SCENARIOS dictionary
Second, you must add an entry to the `SCENARIOS` dictionary in `autogenbranch/clone_cmd.py`.

#### Scripts/init_tasks.py
Finally, you should provide an `Scripts/init_tasks.py` file, in your benchmark folder, and include a `main()` method therein. This method will be loaded and called automatically by `autogenbench clone` after all manifest files have been downloaded.

This `init_tasks.py` script is a great place to download benchmarks from their original sources and convert them to the JSONL format required by AutoGenBench:
- See `HumanEval/Scripts/init_tasks.py` for an example of how to expand a benchmark from an original GitHub repository.
- See `GAIA/Scripts/init_tasks.py` for an example of how to expand a benchmark from `Hugging Face Hub`.
- See `MATH/SCripts/init_tasks.py` for an example of how to expand a benchmark from an author-hosted website.
4 changes: 4 additions & 0 deletions samples/tools/autogenbench/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
recursive-exclude scenarios *
recursive-exclude results *
recursive-exclude tests *
recursive-exclude utils *
172 changes: 172 additions & 0 deletions samples/tools/autogenbench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# AutoGenBench

AutoGenBench is a tool for repeatedly running a set of pre-defined AutoGen tasks in a setting with tightly-controlled initial conditions. With each run, AutoGenBench will start from a blank slate. The agents being evaluated will need to work out what code needs to be written, and what libraries or dependencies to install, to solve tasks. The results of each run are logged, and can be ingested by analysis or metrics scripts (such as `autogenbench tabulate`) By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.

AutoGenBench works with all AutoGen 0.1.*, and 0.2.* versions.

## Technical Specifications

If you are already an AutoGenBench pro, and want the full technical specifications, please review the [contributor's guide](CONTRIBUTING.md).


## Docker Requirement
AutoGenBench also requires Docker (Desktop or Engine). **It will not run in GitHub codespaces**, unless you opt for native execution (with is strongly discouraged). To install Docker Desktop see [https://www.docker.com/products/docker-desktop/](https://www.docker.com/products/docker-desktop/).

## Installation and Setup

**To get the most out of AutoGenBench, the `autogenbench` package should be installed**. At present, the easiest way to do this is to install it via `pip`:

```
pip install autogenbench
```

If you would prefer working from source code (e.g., for development, or to utilize an alternate branch), simply clone the [AutoGen](https://github.com/microsoft/autogen) repository, then install `autogenbench` via:

```
pip install -e autogen/samples/tools/autogenbench
```

After installation, you must configure your API keys. As with other AutoGen applications, AutoGenBench will look for the OpenAI keys in the OAI_CONFIG_LIST file in the current working directory, or the OAI_CONFIG_LIST environment variable. This behavior can be overridden using a command-line parameter described later.

If you will be running multiple benchmarks, it is often most convenient to leverage the environment variable option. You can load your keys into the environment variable by executing:

```
export OAI_CONFIG_LIST=$(cat ./OAI_CONFIG_LIST)
```

If an OAI_CONFIG_LIST is *not* provided (by means of file or environment variable), AutoGenBench will use the OPENAI_API_KEY environment variable instead.


For some benchmark scenarios, additional keys may be required (e.g., keys for the Bing Search API). These can be added to an `ENV.json` file in the current working folder. An example `ENV.json` file is provided below:

```
{
"BING_API_KEY": "xxxyyyzzz"
}
```

## A Typical Session
Once AutoGenBench and necessary keys are installed, a typical session will look as follows:

```
autogenbench clone HumanEval
cd HumanEval
autogenbench run Tasks/r_human_eval_two_agents.jsonl
autogenbench tabulate results/r_human_eval_two_agents
```

Where:
- `autogenbench clone HumanEval` downloads and expands the HumanEval benchmark scenario.
- `autogenbench run Tasks/r_human_eval_two_agents.jsonl` runs the tasks defined in `Tasks/r_human_eval_two_agents.jsonl`
- `autogenbench tablue results/r_human_eval_two_agents` tabulates the results of the run

Each of these commands has extensive in-line help via:

- `autogenbench --help`
- `autogenbench clone --help`
- `autogenbench run --help`
- `autogenbench tabulate --help`

**NOTE:** If you are running `autogenbench` from within the repository, you don’t need to run `autogenbench clone`. Instead, navigate to the appropriate scenario folder (e.g., `scenarios/HumanEval`) and run the `Scripts/init_tasks.py` file.

More details of each command are provided in the sections that follow.

## Cloning Benchmarks
To clone an existing benchmark, simply run:
```
autogenbench clone [BENCHMARK]
```

For example,

```
autogenbench clone HumanEval
```

To see which existing benchmarks are available to clone, run:

```
autogenbench clone --list
```

## Running AutoGenBench

To run a benchmark (which executes the tasks, but does not compute metrics), simply execute:
```
cd [BENCHMARK]
autogenbench run Tasks
```

For example,
```
cd HumanEval
autogenbench run Tasks
```

The default is to run each task once. To run each scenario 10 times, use:

```
autogenbench run --repeat 10 Tasks
```

The `autogenbench` command-line tool allows a number of command-line arguments to control various parameters of execution. Type ``autogenbench -h`` to explore these options:

```
'autogenbench run' will run the specified autogen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics.

positional arguments:
scenario The JSONL scenario file to run. If a directory is specified,
then all JSONL scenarios in the directory are run. (default:
./scenarios)

options:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
The environment variable name or path to the OAI_CONFIG_LIST (default: OAI_CONFIG_LIST).
-r REPEAT, --repeat REPEAT
The number of repetitions to run for each scenario (default: 1).
-s SUBSAMPLE, --subsample SUBSAMPLE
Run on a subsample of the tasks in the JSONL file(s). If a decimal value is specified, then run on
the given proportion of tasks in each file. For example "0.7" would run on 70% of tasks, and "1.0"
would run on 100% of tasks. If an integer value is specified, then randomly select *that* number of
tasks from each specified JSONL file. For example "7" would run tasks, while "1" would run only 1
task from each specified JSONL file. (default: 1.0; which is 100%)
-m MODEL, --model MODEL
Filters the config_list to include only models matching the provided model name (default: None, which
is all models).
--requirements REQUIREMENTS
The requirements file to pip install before running the scenario.
-d DOCKER_IMAGE, --docker-image DOCKER_IMAGE
The Docker image to use when running scenarios. Can not be used together with --native. (default:
'autogenbench:default', which will be created if not present)
--native Run the scenarios natively rather than in docker. NOTE: This is not advisable, and should be done
with great caution.
```

## Results

By default, the AutoGenBench stores results in a folder hierarchy with the following template:

``./results/[scenario]/[task_id]/[instance_id]``

For example, consider the following folders:

``./results/default_two_agents/two_agent_stocks/0``
``./results/default_two_agents/two_agent_stocks/1``

...

``./results/default_two_agents/two_agent_stocks/9``

This folder holds the results for the ``two_agent_stocks`` task of the ``default_two_agents`` tasks file. The ``0`` folder contains the results of the first instance / run. The ``1`` folder contains the results of the second run, and so on. You can think of the _task_id_ as mapping to a prompt, or a unique set of parameters, while the _instance_id_ defines a specific attempt or run.

Within each folder, you will find the following files:

- *timestamp.txt*: records the date and time of the run, along with the version of the pyautogen library installed
- *console_log.txt*: all console output produced by Docker when running autogen. Read this like you would a regular console.
- *[agent]_messages.json*: for each Agent, a log of their messages dictionaries
- *./coding*: A directory containing all code written by Autogen, and all artifacts produced by that code.

## Contributing or Defining New Tasks or Benchmarks

If you would like to develop -- or even contribute -- your own tasks or benchmarks, please review the [contributor's guide](CONTRIBUTING.md) for complete technical details.
1 change: 1 addition & 0 deletions samples/tools/autogenbench/autogenbench/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .version import __version__
4 changes: 4 additions & 0 deletions samples/tools/autogenbench/autogenbench/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from .cli import main

if __name__ == "__main__":
main()
Loading
Loading