Introduces AutoGenBench (microsoft#1048)

* Initial commit of AutoGenBench * wording * typo * pre-commit reformulation * Updated README to point to contributor's guide earlier. * Simplified the description of the JSON format. * Added print statements to indicate when run.sh and scenario.py are starting. * Added SocietyOfMind scenario to GAIA. * Pointing autogenbench clone command to the latest branch. * Temporarily disable subsample option. * Updated the GAIA readme to specify how to define a BING API key. * Fixed and re-enabled the subsample option. * Added a draft of a blog post. * Updated authors. * Incorporating Gagan's feedback. * Fixed code formatting. * Updated the help string in the docs. * Light editing of the AutoGenBench blogpost. * Support filtering on model tags. * Added websurfer dependencies to Dockerfile. * Renamed testbed -> autogenbench * Attempting to fix formatting. * Added more gracefull handling of task timeouts (the script is allowed to terminate before Docker is stopped). * Updated the blogpost based on Saleema's and Julia's feedback. * Fixed formatting... again. * Added a main MANIFEST to list available scenarios. * Limit main manifest to directories. * Manifests now use relative paths. * All manifests are now relative. * Updated the contributing guide, and address windows path issues. * Updated the version. Fixed formatting. * Fixed formatting. * De-listing Examples, since it has no clear tabulate criteria. * Updated email in pyproject * typo in blogpost * wording --------- Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu> Co-authored-by: Qingyun Wu <qingyun0327@gmail.com>
corleroux · Jan 30, 2024 · ac6a4fb · ac6a4fb
1 parent 73d2845
commit ac6a4fb
Show file tree

Hide file tree

Showing 119 changed files with 2,626 additions and 1,883 deletions.
diff --git a/samples/tools/autogenbench/CONTRIBUTING.md b/samples/tools/autogenbench/CONTRIBUTING.md
@@ -0,0 +1,188 @@
+# Contributing to AutoGenBench
+
+As part of the broader AutoGen project, AutoGenBench welcomes community contributions. Contributions are subject to AutoGen's [contribution guidelines](https://microsoft.github.io/autogen/docs/Contribute), as well as a few additional AutoGenBench-specific requirements outlined here. You may also wish to develop your own private benchmark scenarios and the guidance in this document will help with such efforts as well. Below you will find the general requirements, followed by a detailed technical description.
+
+## General Contribution Requirements
+We ask that all contributions to AutoGenBench adhere to the following:
+
+- Follow AutoGen's broader [contribution guidelines](https://microsoft.github.io/autogen/docs/Contribute)
+- All AutoGenBench benchmarks should live in a subfolder of `/samples/tools/autogenbench/scenarios` alongside `HumanEval`, `GAIA`, etc.
+- Benchmark scenarios should include a detailed README.md, in the root of their folder, describing the benchmark and providing citations where warranted.
+- Benchmark data (tasks, ground truth, etc.) should be downloaded from their original sources rather than hosted in the AutoGen repository (unless the benchmark is original, and the repository *is* the original source)
+    - You can use the `Scripts/init_tasks.py` file to automate this download.
+- Basic scoring should be compatible with the `autogenbench tabulate` command (e.g., by outputting logs compatible with the default tabulation mechanism, or by providing a `Scripts/custom_tabulate.py` file)
+- If you wish your benchmark to be compatible with the `autogenbench clone` command, include a `MANIFEST.json` file in the root of your folder.
+
+These requirements are further detailed below, but if you simply copy the `HumanEval` folder, you will already be off to a great start.
+
+## Implementing and Running Benchmark Tasks
+At the core of any benchmark is a set of tasks. To implement tasks that are runnable by AutoGenBench, you must adhere to AutoGenBench's templating and scenario expansion algorithms, as outlined below.
+
+### Task Definitions
+
+All tasks are stored in JSONL files (in subdirectories under `./Tasks`). Each line of a tasks file is a JSON object with the following schema:
+
+```
+{
+   "id": string,
+   "template": dirname,
+   "substitutions" {
+       "filename1": {
+       	   "find_string1_1": replace_string1_1,
+           "find_string1_2": replace_string1_2,
+           ...
+           "find_string1_M": replace_string1_N
+       }
+       "filename2": {
+       	   "find_string2_1": replace_string2_1,
+           "find_string2_2": replace_string2_2,
+           ...
+           "find_string2_N": replace_string2_N
+       }
+   }
+}
+```
+
+For example:
+
+```
+{
+    "id": "two_agent_stocks_gpt4",
+    "template": "default_two_agents",
+    "substitutions": {
+	"scenario.py": {
+            "__MODEL__": "gpt-4",
+	},
+	"prompt.txt": {
+            "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD."
+        }
+    }
+}
+```
+
+In this example, the string `__MODEL__` will be replaced in the file `scenarios.py`, while the string `__PROMPT__` will be replaced in the `prompt.txt` file.
+
+The `template` field can also take on a list value, but this usage is considered advanced and is not described here. See the `autogenbench/run_cmd.py` code, or the `GAIA` benchmark tasks files for additional information about this option.
+
+
+## Task Instance Expansion Algorithm
+
+Once the tasks have been defined, as per above, they must be "instantiated" before they can be run. This instantiation happens automatically when the user issues the `autogenbench run` command and involves creating a local folder to share with Docker. Each instance and repetition gets its own folder along the path: `./results/[scenario]/[task_id]/[instance_id]`. For the sake of brevity we will refer to this folder as the `DEST_FOLDER`.
+
+The algorithm for populating the `DEST_FOLDER` is as follows:
+
+1. Pre-populate DEST_FOLDER with all the basic starter files for running a scenario (found in `autogenbench/template`).
+2. Recursively copy the template folder specified in the JSONL line to DEST_FOLDER (if the JSON `template` attribute points to a folder) If the JSONs `template` attribute instead points to a file, copy the file, but rename it to `scenario.py`
+3. Apply any string replacements, as outlined in the prior section.
+4. Write a run.sh file to DEST_FOLDER that will be executed by Docker when it is loaded. The `run.sh` is described below.
+
+## Scenario Execution Algorithm
+
+Once the task has been instantiated it is run (via run.sh). This script will execute the following steps:
+
+1. If a file named `global_init.sh` is present, run it.
+2. If a file named `scenario_init.sh` is present, run it.
+3. Install the requirements.txt file (if running in Docker)
+4. Run the task via `python scenario.py`
+5. If the scenario.py exited cleanly (exit code 0), then print "SCENARIO.PY COMPLETE !#!#"
+6. Clean up (delete cache, etc.)
+7. If a file named `scenario_finalize.sh` is present, run it.
+8. If a file named `global_finalize.sh` is present, run it.
+9. echo "RUN COMPLETE !#!#", signaling that all steps completed.
+
+Notably, this means that scenarios can add custom init and teardown logic by including `scenario_init.sh` and `scenario_finalize.sh` files.
+
+At the time of this writing, the run.sh file is as follows:
+
+```sh
+export AUTOGEN_TESTBED_SETTING="Docker"
+umask 000
+
+# Run the global init script if it exists
+if [ -f global_init.sh ] ; then
+    . ./global_init.sh
+fi
+
+# Run the scenario init script if it exists
+if [ -f scenario_init.sh ] ; then
+    . ./scenario_init.sh
+fi
+
+# Run the scenario
+pip install -r requirements.txt
+python scenario.py
+EXIT_CODE=$?
+if [ $EXIT_CODE -ne 0 ]; then
+    echo SCENARIO.PY EXITED WITH CODE: $EXIT_CODE !#!#
+else
+    echo SCENARIO.PY COMPLETE !#!#
+fi
+
+# Clean up
+if [ -d .cache ] ; then
+    rm -Rf .cache
+fi
+
+# Run the scenario finalize script if it exists
+if [ -f scenario_finalize.sh ] ; then
+    . ./scenario_finalize.sh
+fi
+
+# Run the global finalize script if it exists
+if [ -f global_finalize.sh ] ; then
+    . ./global_finalize.sh
+fi
+
+echo RUN.SH COMPLETE !#!#
+```
+
+Be warned that this listing is provided here for illustration purposes, and may vary over time. The source of truth are the `run.sh` files found in the ``./results/[taskset]/[task_id]/[instance_id]`` folders.
+
+
+## Integrating with the `tabulate` and `clone` commands.
+
+The above details are sufficient for defining and running tasks, but if you wish to support the `autogenbench tabulate` and `autogenbench clone` commands, a few additional steps are required.
+
+### Tabulations
+
+If you wish to leverage the default tabulation logic, it is as simple as arranging your `scenario.py` file to output the string "ALL TESTS PASSED !#!#" to the console in the event that a task was solved correctly.
+
+If you wish to implement your own tabulation logic, simply create the file `Scripts/custom_tabulate.py` and include a `main(args)` method. Here, the `args` parameter will be provided by AutoGenBench, and is a drop-in replacement for `sys.argv`. In particular, `args[0]` will be the invocation command (similar to the executable or script name in `sys.argv`), and the remaining values (`args[1:]`) are the command line parameters.
+
+Should you provide a custom tabulation script, please implement `--help` and `-h` options for documenting your interface.
+
+The `scenarios/GAIA/Scripts/custom_tabulate.py` is a great example of custom tabulation. It also shows how you can reuse some components of the default tabulator to speed up development.
+
+
+### Cloning
+
+If you wish your benchmark to be available via the `autogenbench clone` command, you will need to take three additional steps:
+
+#### Manifest
+First, provide a `MANIFEST.json` file in the root of your benchmark. An example is provided below, from which you can see the schema:
+
+```json
+{
+    "files": {
+        "Templates/TwoAgents/prompt.txt": "Templates/TwoAgents/prompt.txt",
+        "Templates/TwoAgents/coding/my_tests.py": "Templates/TwoAgents/coding/my_tests.py",
+        "Templates/TwoAgents/scenario.py": "Templates/TwoAgents/scenario.py",
+        "README.md": "README.md",
+	"Scripts/init_tasks.py": "Scripts/init_tasks.py",
+	"Scripts/custom_tabulate.py": "Scripts/custom_tabulate.py"
+    }
+}
+```
+
+The keys of the `files` dictionary are local paths, relative to your benchmark's root directory. The values are relative paths in the AutoGen GitHub repository (relative to the folder where the MANIFEST.json file is located). In most cases, the keys and values will be identical.
+
+#### SCENARIOS dictionary
+Second, you must add an entry to the `scenarios` dictionary in `autogen/samples/tools/autogenbench/scenarios/MANIFEST.json`.
+
+#### Scripts/init_tasks.py
+Finally, you should provide an `Scripts/init_tasks.py` file, in your benchmark folder, and include a `main()` method therein. This method will be loaded and called automatically by `autogenbench clone` after all manifest files have been downloaded.
+
+This `init_tasks.py` script is a great place to download benchmarks from their original sources and convert them to the JSONL format required by AutoGenBench:
+- See `HumanEval/Scripts/init_tasks.py` for an example of how to expand a benchmark from an original GitHub repository.
+- See `GAIA/Scripts/init_tasks.py` for an example of how to expand a benchmark from `Hugging Face Hub`.
+- See `MATH/SCripts/init_tasks.py` for an example of how to expand a benchmark from an author-hosted website.
diff --git a/samples/tools/autogenbench/MANIFEST.in b/samples/tools/autogenbench/MANIFEST.in
@@ -0,0 +1,4 @@
+recursive-exclude  scenarios *
+recursive-exclude  results *
+recursive-exclude  tests *
+recursive-exclude  utils *
diff --git a/samples/tools/autogenbench/README.md b/samples/tools/autogenbench/README.md
@@ -0,0 +1,172 @@
+# AutoGenBench
+
+AutoGenBench is a tool for repeatedly running a set of pre-defined AutoGen tasks in a setting with tightly-controlled initial conditions. With each run, AutoGenBench will start from a blank slate. The agents being evaluated will need to work out what code needs to be written, and what libraries or dependencies to install, to solve tasks. The results of each run are logged, and can be ingested by analysis or metrics scripts (such as `autogenbench tabulate`). By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.
+
+AutoGenBench works with all AutoGen 0.1.*, and 0.2.* versions.
+
+## Technical Specifications
+
+If you are already an AutoGenBench pro, and want the full technical specifications, please review the [contributor's guide](CONTRIBUTING.md).
+
+
+## Docker Requirement
+AutoGenBench also requires Docker (Desktop or Engine). **It will not run in GitHub codespaces**, unless you opt for native execution (with is strongly discouraged). To install Docker Desktop see [https://www.docker.com/products/docker-desktop/](https://www.docker.com/products/docker-desktop/).
+
+## Installation and Setup
+
+**To get the most out of AutoGenBench, the `autogenbench` package should be installed**. At present, the easiest way to do this is to install it via `pip`:
+
+```
+pip install autogenbench
+```
+
+If you would prefer working from source code (e.g., for development, or to utilize an alternate branch), simply clone the [AutoGen](https://github.com/microsoft/autogen) repository, then install `autogenbench` via:
+
+```
+pip install -e autogen/samples/tools/autogenbench
+```
+
+After installation, you must configure your API keys. As with other AutoGen applications, AutoGenBench will look for the OpenAI keys in the OAI_CONFIG_LIST file in the current working directory, or the OAI_CONFIG_LIST environment variable. This behavior can be overridden using a command-line parameter described later.
+
+If you will be running multiple benchmarks, it is often most convenient to leverage the environment variable option. You can load your keys into the environment variable by executing:
+
+```
+export OAI_CONFIG_LIST=$(cat ./OAI_CONFIG_LIST)
+```
+
+If an OAI_CONFIG_LIST is *not* provided (by means of file or environment variable), AutoGenBench will use the OPENAI_API_KEY environment variable instead.
+
+
+For some benchmark scenarios, additional keys may be required (e.g., keys for the Bing Search API). These can be added to an `ENV.json` file in the current working folder. An example `ENV.json` file is provided below:
+
+```
+{
+    "BING_API_KEY": "xxxyyyzzz"
+}
+```
+
+## A Typical Session
+Once AutoGenBench and necessary keys are installed, a typical session will look as follows:
+
+```
+autogenbench clone HumanEval
+cd HumanEval
+autogenbench run Tasks/r_human_eval_two_agents.jsonl
+autogenbench tabulate results/r_human_eval_two_agents
+```
+
+Where:
+- `autogenbench clone HumanEval` downloads and expands the HumanEval benchmark scenario.
+- `autogenbench run Tasks/r_human_eval_two_agents.jsonl` runs the tasks defined in `Tasks/r_human_eval_two_agents.jsonl`
+- `autogenbench tablue results/r_human_eval_two_agents` tabulates the results of the run
+
+Each of these commands has extensive in-line help via:
+
+- `autogenbench --help`
+- `autogenbench clone --help`
+- `autogenbench run --help`
+- `autogenbench tabulate --help`
+
+**NOTE:** If you are running `autogenbench` from within the repository, you don’t need to run `autogenbench clone`. Instead, navigate to the appropriate scenario folder (e.g., `scenarios/HumanEval`) and run the `Scripts/init_tasks.py` file.
+
+More details of each command are provided in the sections that follow.
+
+## Cloning Benchmarks
+To clone an existing benchmark, simply run:
+```
+autogenbench clone [BENCHMARK]
+```
+
+For example,
+
+```
+autogenbench clone HumanEval
+```
+
+To see which existing benchmarks are available to clone, run:
+
+```
+autogenbench clone --list
+```
+
+## Running AutoGenBench
+
+To run a benchmark (which executes the tasks, but does not compute metrics), simply execute:
+```
+cd [BENCHMARK]
+autogenbench run Tasks
+```
+
+For example,
+```
+cd HumanEval
+autogenbench run Tasks
+```
+
+The default is to run each task once. To run each scenario 10 times, use:
+
+```
+autogenbench run --repeat 10 Tasks
+```
+
+The `autogenbench` command-line tool allows a number of command-line arguments to control various parameters of execution. Type ``autogenbench -h`` to explore these options:
+
+```
+'autogenbench run' will run the specified autogen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics.
+
+positional arguments:
+  scenario      The JSONL scenario file to run. If a directory is specified,
+                then all JSONL scenarios in the directory are run. (default:
+                ./scenarios)
+
+options:
+  -h, --help            show this help message and exit
+  -c CONFIG, --config CONFIG
+                        The environment variable name or path to the OAI_CONFIG_LIST (default: OAI_CONFIG_LIST).
+  -r REPEAT, --repeat REPEAT
+                        The number of repetitions to run for each scenario (default: 1).
+  -s SUBSAMPLE, --subsample SUBSAMPLE
+                        Run on a subsample of the tasks in the JSONL file(s). If a decimal value is specified, then run on
+                        the given proportion of tasks in each file. For example "0.7" would run on 70% of tasks, and "1.0"
+                        would run on 100% of tasks. If an integer value is specified, then randomly select *that* number of
+                        tasks from each specified JSONL file. For example "7" would run tasks, while "1" would run only 1
+                        task from each specified JSONL file. (default: 1.0; which is 100%)
+  -m MODEL, --model MODEL
+                        Filters the config_list to include only models matching the provided model name (default: None, which
+                        is all models).
+  --requirements REQUIREMENTS
+                        The requirements file to pip install before running the scenario.
+  -d DOCKER_IMAGE, --docker-image DOCKER_IMAGE
+                        The Docker image to use when running scenarios. Can not be used together with --native. (default:
+                        'autogenbench:default', which will be created if not present)
+  --native              Run the scenarios natively rather than in docker. NOTE: This is not advisable, and should be done
+                        with great caution.
+```
+
+## Results
+
+By default, the AutoGenBench stores results in a folder hierarchy with the following template:
+
+``./results/[scenario]/[task_id]/[instance_id]``
+
+For example, consider the following folders:
+
+``./results/default_two_agents/two_agent_stocks/0``
+``./results/default_two_agents/two_agent_stocks/1``
+
+...
+
+``./results/default_two_agents/two_agent_stocks/9``
+
+This folder holds the results for the ``two_agent_stocks`` task of the ``default_two_agents`` tasks file. The ``0`` folder contains the results of the first instance / run. The ``1`` folder contains the results of the second run, and so on. You can think of the _task_id_ as mapping to a prompt, or a unique set of parameters, while the _instance_id_ defines a specific attempt or run.
+
+Within each folder, you will find the following files:
+
+- *timestamp.txt*: records the date and time of the run, along with the version of the pyautogen library installed
+- *console_log.txt*: all console output produced by Docker when running AutoGen. Read this like you would a regular console.
+- *[agent]_messages.json*: for each Agent, a log of their messages dictionaries
+- *./coding*: A directory containing all code written by AutoGen, and all artifacts produced by that code.
+
+## Contributing or Defining New Tasks or Benchmarks
+
+If you would like to develop -- or even contribute -- your own tasks or benchmarks, please review the [contributor's guide](CONTRIBUTING.md) for complete technical details.
diff --git a/samples/tools/autogenbench/autogenbench/__init__.py b/samples/tools/autogenbench/autogenbench/__init__.py
@@ -0,0 +1 @@
+from .version import __version__
diff --git a/samples/tools/autogenbench/autogenbench/__main__.py b/samples/tools/autogenbench/autogenbench/__main__.py
@@ -0,0 +1,4 @@
+from .cli import main
+
+if __name__ == "__main__":
+    main()