diff --git a/samples/tools/autogenbench/CONTRIBUTING.md b/samples/tools/autogenbench/CONTRIBUTING.md
new file mode 100644
index 00000000000..ec32906eea2
--- /dev/null
+++ b/samples/tools/autogenbench/CONTRIBUTING.md
@@ -0,0 +1,188 @@
+# Contributing to AutoGenBench
+
+As part of the broader AutoGen project, AutoGenBench welcomes community contributions. Contributions are subject to AutoGen's [contribution guidelines](https://microsoft.github.io/autogen/docs/Contribute), as well as a few additional AutoGenBench-specific requirements outlined here. You may also wish to develop your own private benchmark scenarios and the guidance in this document will help with such efforts as well. Below you will find the general requirements, followed by a detailed technical description.
+
+## General Contribution Requirements
+We ask that all contributions to AutoGenBench adhere to the following:
+
+- Follow AutoGen's broader [contribution guidelines](https://microsoft.github.io/autogen/docs/Contribute)
+- All AutoGenBench benchmarks should live in a subfolder of `/samples/tools/autogenbench/scenarios` alongside `HumanEval`, `GAIA`, etc.
+- Benchmark scenarios should include a detailed README.md, in the root of their folder, describing the benchmark and providing citations where warranted.
+- Benchmark data (tasks, ground truth, etc.) should be downloaded from their original sources rather than hosted in the AutoGen repository (unless the benchmark is original, and the repository *is* the original source)
+    - You can use the `Scripts/init_tasks.py` file to automate this download.
+- Basic scoring should be compatible with the `autogenbench tabulate` command (e.g., by outputting logs compatible with the default tabulation mechanism, or by providing a `Scripts/custom_tabulate.py` file)
+- If you wish your benchmark to be compatible with the `autogenbench clone` command, include a `MANIFEST.json` file in the root of your folder.
+
+These requirements are further detailed below, but if you simply copy the `HumanEval` folder, you will already be off to a great start.
+
+## Implementing and Running Benchmark Tasks
+At the core of any benchmark is a set of tasks. To implement tasks that are runnable by AutoGenBench, you must adhere to AutoGenBench's templating and scenario expansion algorithms, as outlined below.
+
+### Task Definitions
+
+All tasks are stored in JSONL files (in subdirectories under `./Tasks`). Each line of a tasks file is a JSON object with the following schema:
+
+```
+{
+   "id": string,
+   "template": dirname,
+   "substitutions" {
+       "filename1": {
+       	   "find_string1_1": replace_string1_1,
+           "find_string1_2": replace_string1_2,
+           ...
+           "find_string1_M": replace_string1_N
+       }
+       "filename2": {
+       	   "find_string2_1": replace_string2_1,
+           "find_string2_2": replace_string2_2,
+           ...
+           "find_string2_N": replace_string2_N
+       }
+   }
+}
+```
+
+For example:
+
+```
+{
+    "id": "two_agent_stocks_gpt4",
+    "template": "default_two_agents",
+    "substitutions": {
+	"scenario.py": {
+            "__MODEL__": "gpt-4",
+	},
+	"prompt.txt": {
+            "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD."
+        }
+    }
+}
+```
+
+In this example, the string `__MODEL__` will be replaced in the file `scenarios.py`, while the string `__PROMPT__` will be replaced in the `prompt.txt` file.
+
+The `template` field can also take on a list value, but this usage is considered advanced and is not described here. See the `autogenbench/run_cmd.py` code, or the `GAIA` benchmark tasks files for additional information about this option.
+
+
+## Task Instance Expansion Algorithm
+
+Once the tasks have been defined, as per above, they must be "instantiated" before they can be run. This instantiation happens automatically when the user issues the `autogenbench run` command and involves creating a local folder to share with Docker. Each instance and repetition gets its own folder along the path: `./results/[scenario]/[task_id]/[instance_id]`. For the sake of brevity we will refer to this folder as the `DEST_FOLDER`.
+
+The algorithm for populating the `DEST_FOLDER` is as follows:
+
+1. Pre-populate DEST_FOLDER with all the basic starter files for running a scenario (found in `autogenbench/template`).
+2. Recursively copy the template folder specified in the JSONL line to DEST_FOLDER (if the JSON `template` attribute points to a folder) If the JSONs `template` attribute instead points to a file, copy the file, but rename it to `scenario.py`
+3. Apply any string replacements, as outlined in the prior section.
+4. Write a run.sh file to DEST_FOLDER that will be executed by Docker when it is loaded. The `run.sh` is described below.
+
+## Scenario Execution Algorithm
+
+Once the task has been instantiated it is run (via run.sh). This script will execute the following steps:
+
+1. If a file named `global_init.sh` is present, run it.
+2. If a file named `scenario_init.sh` is present, run it.
+3. Install the requirements.txt file (if running in Docker)
+4. Run the task via `python scenario.py`
+5. If the scenario.py exited cleanly (exit code 0), then print "SCENARIO.PY COMPLETE !#!#"
+6. Clean up (delete cache, etc.)
+7. If a file named `scenario_finalize.sh` is present, run it.
+8. If a file named `global_finalize.sh` is present, run it.
+9. echo "RUN COMPLETE !#!#", signaling that all steps completed.
+
+Notably, this means that scenarios can add custom init and teardown logic by including `scenario_init.sh` and `scenario_finalize.sh` files.
+
+At the time of this writing, the run.sh file is as follows:
+
+```sh
+export AUTOGEN_TESTBED_SETTING="Docker"
+umask 000
+
+# Run the global init script if it exists
+if [ -f global_init.sh ] ; then
+    . ./global_init.sh
+fi
+
+# Run the scenario init script if it exists
+if [ -f scenario_init.sh ] ; then
+    . ./scenario_init.sh
+fi
+
+# Run the scenario
+pip install -r requirements.txt
+python scenario.py
+EXIT_CODE=$?
+if [ $EXIT_CODE -ne 0 ]; then
+    echo SCENARIO.PY EXITED WITH CODE: $EXIT_CODE !#!#
+else
+    echo SCENARIO.PY COMPLETE !#!#
+fi
+
+# Clean up
+if [ -d .cache ] ; then
+    rm -Rf .cache
+fi
+
+# Run the scenario finalize script if it exists
+if [ -f scenario_finalize.sh ] ; then
+    . ./scenario_finalize.sh
+fi
+
+# Run the global finalize script if it exists
+if [ -f global_finalize.sh ] ; then
+    . ./global_finalize.sh
+fi
+
+echo RUN.SH COMPLETE !#!#
+```
+
+Be warned that this listing is provided here for illustration purposes, and may vary over time. The source of truth are the `run.sh` files found in the ``./results/[taskset]/[task_id]/[instance_id]`` folders.
+
+
+## Integrating with the `tabulate` and `clone` commands.
+
+The above details are sufficient for defining and running tasks, but if you wish to support the `autogenbench tabulate` and `autogenbench clone` commands, a few additional steps are required.
+
+### Tabulations
+
+If you wish to leverage the default tabulation logic, it is as simple as arranging your `scenario.py` file to output the string "ALL TESTS PASSED !#!#" to the console in the event that a task was solved correctly.
+
+If you wish to implement your own tabulation logic, simply create the file `Scripts/custom_tabulate.py` and include a `main(args)` method. Here, the `args` parameter will be provided by AutoGenBench, and is a drop-in replacement for `sys.argv`. In particular, `args[0]` will be the invocation command (similar to the executable or script name in `sys.argv`), and the remaining values (`args[1:]`) are the command line parameters.
+
+Should you provide a custom tabulation script, please implement `--help` and `-h` options for documenting your interface.
+
+The `scenarios/GAIA/Scripts/custom_tabulate.py` is a great example of custom tabulation. It also shows how you can reuse some components of the default tabulator to speed up development.
+
+
+### Cloning
+
+If you wish your benchmark to be available via the `autogenbench clone` command, you will need to take three additional steps:
+
+#### Manifest
+First, provide a `MANIFEST.json` file in the root of your benchmark. An example is provided below, from which you can see the schema:
+
+```json
+{
+    "files": {
+        "Templates/TwoAgents/prompt.txt": "Templates/TwoAgents/prompt.txt",
+        "Templates/TwoAgents/coding/my_tests.py": "Templates/TwoAgents/coding/my_tests.py",
+        "Templates/TwoAgents/scenario.py": "Templates/TwoAgents/scenario.py",
+        "README.md": "README.md",
+	"Scripts/init_tasks.py": "Scripts/init_tasks.py",
+	"Scripts/custom_tabulate.py": "Scripts/custom_tabulate.py"
+    }
+}
+```
+
+The keys of the `files` dictionary are local paths, relative to your benchmark's root directory. The values are relative paths in the AutoGen GitHub repository (relative to the folder where the MANIFEST.json file is located). In most cases, the keys and values will be identical.
+
+#### SCENARIOS dictionary
+Second, you must add an entry to the `scenarios` dictionary in `autogen/samples/tools/autogenbench/scenarios/MANIFEST.json`.
+
+#### Scripts/init_tasks.py
+Finally, you should provide an `Scripts/init_tasks.py` file, in your benchmark folder, and include a `main()` method therein. This method will be loaded and called automatically by `autogenbench clone` after all manifest files have been downloaded.
+
+This `init_tasks.py` script is a great place to download benchmarks from their original sources and convert them to the JSONL format required by AutoGenBench:
+- See `HumanEval/Scripts/init_tasks.py` for an example of how to expand a benchmark from an original GitHub repository.
+- See `GAIA/Scripts/init_tasks.py` for an example of how to expand a benchmark from `Hugging Face Hub`.
+- See `MATH/SCripts/init_tasks.py` for an example of how to expand a benchmark from an author-hosted website.
diff --git a/samples/tools/autogenbench/MANIFEST.in b/samples/tools/autogenbench/MANIFEST.in
new file mode 100644
index 00000000000..84654bcd6e4
--- /dev/null
+++ b/samples/tools/autogenbench/MANIFEST.in
@@ -0,0 +1,4 @@
+recursive-exclude  scenarios *
+recursive-exclude  results *
+recursive-exclude  tests *
+recursive-exclude  utils *
diff --git a/samples/tools/autogenbench/README.md b/samples/tools/autogenbench/README.md
new file mode 100644
index 00000000000..9c747c9896d
--- /dev/null
+++ b/samples/tools/autogenbench/README.md
@@ -0,0 +1,172 @@
+# AutoGenBench
+
+AutoGenBench is a tool for repeatedly running a set of pre-defined AutoGen tasks in a setting with tightly-controlled initial conditions. With each run, AutoGenBench will start from a blank slate. The agents being evaluated will need to work out what code needs to be written, and what libraries or dependencies to install, to solve tasks. The results of each run are logged, and can be ingested by analysis or metrics scripts (such as `autogenbench tabulate`). By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.
+
+AutoGenBench works with all AutoGen 0.1.*, and 0.2.* versions.
+
+## Technical Specifications
+
+If you are already an AutoGenBench pro, and want the full technical specifications, please review the [contributor's guide](CONTRIBUTING.md).
+
+
+## Docker Requirement
+AutoGenBench also requires Docker (Desktop or Engine). **It will not run in GitHub codespaces**, unless you opt for native execution (with is strongly discouraged). To install Docker Desktop see [https://www.docker.com/products/docker-desktop/](https://www.docker.com/products/docker-desktop/).
+
+## Installation and Setup
+
+**To get the most out of AutoGenBench, the `autogenbench` package should be installed**. At present, the easiest way to do this is to install it via `pip`:
+
+```
+pip install autogenbench
+```
+
+If you would prefer working from source code (e.g., for development, or to utilize an alternate branch), simply clone the [AutoGen](https://github.com/microsoft/autogen) repository, then install `autogenbench` via:
+
+```
+pip install -e autogen/samples/tools/autogenbench
+```
+
+After installation, you must configure your API keys. As with other AutoGen applications, AutoGenBench will look for the OpenAI keys in the OAI_CONFIG_LIST file in the current working directory, or the OAI_CONFIG_LIST environment variable. This behavior can be overridden using a command-line parameter described later.
+
+If you will be running multiple benchmarks, it is often most convenient to leverage the environment variable option. You can load your keys into the environment variable by executing:
+
+```
+export OAI_CONFIG_LIST=$(cat ./OAI_CONFIG_LIST)
+```
+
+If an OAI_CONFIG_LIST is *not* provided (by means of file or environment variable), AutoGenBench will use the OPENAI_API_KEY environment variable instead.
+
+
+For some benchmark scenarios, additional keys may be required (e.g., keys for the Bing Search API). These can be added to an `ENV.json` file in the current working folder. An example `ENV.json` file is provided below:
+
+```
+{
+    "BING_API_KEY": "xxxyyyzzz"
+}
+```
+
+## A Typical Session
+Once AutoGenBench and necessary keys are installed, a typical session will look as follows:
+
+```
+autogenbench clone HumanEval
+cd HumanEval
+autogenbench run Tasks/r_human_eval_two_agents.jsonl
+autogenbench tabulate results/r_human_eval_two_agents
+```
+
+Where:
+- `autogenbench clone HumanEval` downloads and expands the HumanEval benchmark scenario.
+- `autogenbench run Tasks/r_human_eval_two_agents.jsonl` runs the tasks defined in `Tasks/r_human_eval_two_agents.jsonl`
+- `autogenbench tablue results/r_human_eval_two_agents` tabulates the results of the run
+
+Each of these commands has extensive in-line help via:
+
+- `autogenbench --help`
+- `autogenbench clone --help`
+- `autogenbench run --help`
+- `autogenbench tabulate --help`
+
+**NOTE:** If you are running `autogenbench` from within the repository, you don’t need to run `autogenbench clone`. Instead, navigate to the appropriate scenario folder (e.g., `scenarios/HumanEval`) and run the `Scripts/init_tasks.py` file.
+
+More details of each command are provided in the sections that follow.
+
+## Cloning Benchmarks
+To clone an existing benchmark, simply run:
+```
+autogenbench clone [BENCHMARK]
+```
+
+For example,
+
+```
+autogenbench clone HumanEval
+```
+
+To see which existing benchmarks are available to clone, run:
+
+```
+autogenbench clone --list
+```
+
+## Running AutoGenBench
+
+To run a benchmark (which executes the tasks, but does not compute metrics), simply execute:
+```
+cd [BENCHMARK]
+autogenbench run Tasks
+```
+
+For example,
+```
+cd HumanEval
+autogenbench run Tasks
+```
+
+The default is to run each task once. To run each scenario 10 times, use:
+
+```
+autogenbench run --repeat 10 Tasks
+```
+
+The `autogenbench` command-line tool allows a number of command-line arguments to control various parameters of execution. Type ``autogenbench -h`` to explore these options:
+
+```
+'autogenbench run' will run the specified autogen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics.
+
+positional arguments:
+  scenario      The JSONL scenario file to run. If a directory is specified,
+                then all JSONL scenarios in the directory are run. (default:
+                ./scenarios)
+
+options:
+  -h, --help            show this help message and exit
+  -c CONFIG, --config CONFIG
+                        The environment variable name or path to the OAI_CONFIG_LIST (default: OAI_CONFIG_LIST).
+  -r REPEAT, --repeat REPEAT
+                        The number of repetitions to run for each scenario (default: 1).
+  -s SUBSAMPLE, --subsample SUBSAMPLE
+                        Run on a subsample of the tasks in the JSONL file(s). If a decimal value is specified, then run on
+                        the given proportion of tasks in each file. For example "0.7" would run on 70% of tasks, and "1.0"
+                        would run on 100% of tasks. If an integer value is specified, then randomly select *that* number of
+                        tasks from each specified JSONL file. For example "7" would run tasks, while "1" would run only 1
+                        task from each specified JSONL file. (default: 1.0; which is 100%)
+  -m MODEL, --model MODEL
+                        Filters the config_list to include only models matching the provided model name (default: None, which
+                        is all models).
+  --requirements REQUIREMENTS
+                        The requirements file to pip install before running the scenario.
+  -d DOCKER_IMAGE, --docker-image DOCKER_IMAGE
+                        The Docker image to use when running scenarios. Can not be used together with --native. (default:
+                        'autogenbench:default', which will be created if not present)
+  --native              Run the scenarios natively rather than in docker. NOTE: This is not advisable, and should be done
+                        with great caution.
+```
+
+## Results
+
+By default, the AutoGenBench stores results in a folder hierarchy with the following template:
+
+``./results/[scenario]/[task_id]/[instance_id]``
+
+For example, consider the following folders:
+
+``./results/default_two_agents/two_agent_stocks/0``
+``./results/default_two_agents/two_agent_stocks/1``
+
+...
+
+``./results/default_two_agents/two_agent_stocks/9``
+
+This folder holds the results for the ``two_agent_stocks`` task of the ``default_two_agents`` tasks file. The ``0`` folder contains the results of the first instance / run. The ``1`` folder contains the results of the second run, and so on. You can think of the _task_id_ as mapping to a prompt, or a unique set of parameters, while the _instance_id_ defines a specific attempt or run.
+
+Within each folder, you will find the following files:
+
+- *timestamp.txt*: records the date and time of the run, along with the version of the pyautogen library installed
+- *console_log.txt*: all console output produced by Docker when running AutoGen. Read this like you would a regular console.
+- *[agent]_messages.json*: for each Agent, a log of their messages dictionaries
+- *./coding*: A directory containing all code written by AutoGen, and all artifacts produced by that code.
+
+## Contributing or Defining New Tasks or Benchmarks
+
+If you would like to develop -- or even contribute -- your own tasks or benchmarks, please review the [contributor's guide](CONTRIBUTING.md) for complete technical details.
diff --git a/samples/tools/autogenbench/autogenbench/__init__.py b/samples/tools/autogenbench/autogenbench/__init__.py
new file mode 100644
index 00000000000..58f3ace6c03
--- /dev/null
+++ b/samples/tools/autogenbench/autogenbench/__init__.py
@@ -0,0 +1 @@
+from .version import __version__
diff --git a/samples/tools/autogenbench/autogenbench/__main__.py b/samples/tools/autogenbench/autogenbench/__main__.py
new file mode 100644
index 00000000000..9ae637f13cd
--- /dev/null
+++ b/samples/tools/autogenbench/autogenbench/__main__.py
@@ -0,0 +1,4 @@
+from .cli import main
+
+if __name__ == "__main__":
+    main()
diff --git a/samples/tools/autogenbench/autogenbench/cli.py b/samples/tools/autogenbench/autogenbench/cli.py
new file mode 100644
index 00000000000..dd0ebd70ea7
--- /dev/null
+++ b/samples/tools/autogenbench/autogenbench/cli.py
@@ -0,0 +1,93 @@
+import sys
+from .run_cmd import run_cli
+from .clone_cmd import clone_cli
+from .tabulate_cmd import tabulate_cli
+
+
+def main(args=None):
+    if args is None:
+        args = sys.argv[:]  # Shallow copy
+
+    invocation_cmd = "autogenbench"
+
+    commands = [
+        {
+            "command": "clone",
+            "description": "download and expand a benchmark",
+            "function": clone_cli,
+        },
+        {
+            "command": "run",
+            "description": "run a given benchmark configuration",
+            "function": run_cli,
+        },
+        {
+            "command": "tabulate",
+            "description": "tabulate the results of a previous run",
+            "function": tabulate_cli,
+        },
+        {"command": "--help", "description": "print this message", "function": None},
+    ]
+
+    # Some help string formatting
+    commands_list = ", ".join(["'" + c["command"] + "'" for c in commands])
+    max_command_len = max([len(c["command"]) for c in commands])
+    commands_details = ""
+    for c in commands:
+        padded_cmd = c["command"]
+        while len(padded_cmd) < max_command_len:
+            padded_cmd = " " + padded_cmd
+        commands_details += f"    {padded_cmd}: {c['description']}\n"
+
+    usage_text = f"""
+usage: {invocation_cmd} COMMAND ARGS
+
+Where, COMMAND is one of: {commands_list}
+
+and ARGS are specific to the command.
+(use '{invocation_cmd} COMMAND --help' for command-specific help)
+""".strip()
+
+    help_text = f"""
+usage: {invocation_cmd} COMMAND ARGS
+
+{invocation_cmd} is a tool for running and managing AutoGen benchmark scenarios. A typically session might resemble:
+
+    {invocation_cmd} clone HumanEval
+    cd HumanEval
+    {invocation_cmd} run Tasks/human_eval_two_agents_gpt4.jsonl
+
+which will download the HumanEval benchmark, expand it, and then run the benchmark once with the `human_eval_two_agents_gpt4` configuration.
+
+Available COMMANDs include:
+
+{commands_details}
+
+Additionally, you can use the --help option with any command for further command-specific instructions. E.g.,
+
+    {invocation_cmd} run --help
+    {invocation_cmd} clone --help
+
+""".strip()
+
+    if len(args) < 2:
+        sys.stderr.write(usage_text + "\n")
+        sys.exit(2)
+
+    for command in commands:
+        if args[1].lower() == command["command"]:
+            if command["function"] is None:
+                sys.stderr.write(help_text + "\n")
+                sys.exit(0)
+            else:
+                command["function"]([invocation_cmd + " " + command["command"]] + args[2:])
+                sys.exit(0)
+
+    # Command not found
+    sys.stderr.write(f"Invalid command '{args[1]}'. Available commands include: {commands_list}\n")
+    sys.exit(2)
+
+
+###############################################################################
+if __name__ == "__main__":
+    main()
diff --git a/samples/tools/autogenbench/autogenbench/clone_cmd.py b/samples/tools/autogenbench/autogenbench/clone_cmd.py
new file mode 100644
index 00000000000..db9904d68ec
--- /dev/null
+++ b/samples/tools/autogenbench/autogenbench/clone_cmd.py
@@ -0,0 +1,147 @@
+import os
+import json
+import argparse
+import requests
+from .load_module import load_module
+
+# Figure out where everything is
+SCRIPT_PATH = os.path.realpath(__file__)
+SCRIPT_NAME = os.path.basename(SCRIPT_PATH)
+SCRIPT_DIR = os.path.dirname(SCRIPT_PATH)
+
+# Where are the manifests located?
+DEFAULT_REPO = "https://raw.githubusercontent.com/microsoft/autogen/"
+DEFAULT_BRANCH = "main"
+DEFAULT_PATH = "/samples/tools/autogenbench/scenarios/"
+# Full url is specified by DEFAULT_REPO + DEFAULT_BRANCH + DEFAULT_PATH
+
+
+def _expand_url(url_fragment, base_url):
+    """
+    If the url is a relative path, append the URL_PREFIX, otherwise return it whole.
+    """
+    if url_fragment.startswith("http://") or url_fragment.startswith("https://"):
+        return url_fragment
+    else:
+        return base_url + url_fragment
+
+
+def get_scenarios(base_url):
+    """
+    Return a list of scenarios.
+    """
+    response = requests.get(_expand_url("MANIFEST.json", base_url), stream=False)
+    response.raise_for_status()
+    manifest = json.loads(response.text)
+    return manifest["scenarios"]
+
+
+def clone_scenario(scenario, base_url):
+    # If the scenario is a url, then we can just look up that folder directly
+    if scenario.startswith("http://") or scenario.startswith("https://"):
+        scenario_url = scenario
+        local_folder = os.path.abspath(".")
+    # otherwise, read it from the main manifest file
+    else:
+        scenarios = get_scenarios(base_url)
+        if scenario not in scenarios:
+            raise ValueError(f"No such scenario '{scenario}'.")
+        scenario_url = _expand_url(scenarios[scenario], base_url)
+        local_folder = os.path.abspath(scenario)
+
+    # Download the manifest
+    print("Fetching manifest...")
+    manifest = None
+    response = requests.get(_expand_url("MANIFEST.json", scenario_url), stream=False)
+    response.raise_for_status()
+    manifest = json.loads(response.text)
+
+    # Download the files
+    for item in manifest["files"].items():
+        path = item[0]
+
+        # Fixes paths on windows
+        parts = path.split("/")
+        path = os.path.join(*parts)
+
+        raw_url = _expand_url(item[1], scenario_url)
+        dir_name = os.path.join(local_folder, os.path.dirname(path))
+        file_name = os.path.basename(path)
+        path = os.path.join(dir_name, file_name)
+
+        print(f"'{raw_url}' -> '{path}'")
+
+        # Make the directory
+        os.makedirs(dir_name, exist_ok=True)
+
+        # Send a HTTP request to the URL
+        response = requests.get(raw_url, stream=True)
+        response.raise_for_status()
+
+        # If the HTTP request returns a status code 200, proceed
+        with open(path, "wb") as fh:
+            for chunk in response.iter_content(chunk_size=512):
+                fh.write(chunk)
+
+    # Run any init_tasks scripts
+    init_tasks_script = os.path.join(local_folder, "Scripts", "init_tasks.py")
+    if os.path.isfile(init_tasks_script):
+        load_module(init_tasks_script).main()
+
+    # Print the success
+    print(f"\n\nSuccessfully cloned '{scenario}'")
+    for readme in ["README.md", "README.txt", "README"]:
+        if os.path.isfile(os.path.join(local_folder, readme)):
+            print(f"Please read '{os.path.join(local_folder, readme)}' for more information on running this benchmark.")
+            break
+
+
+def clone_cli(args):
+    invocation_cmd = args[0]
+    args = args[1:]
+
+    # Prepare the argument parser
+    parser = argparse.ArgumentParser(
+        prog=invocation_cmd,
+        description=f"{invocation_cmd} will clone the specified scenario to the current working directory.",
+    )
+
+    parser.add_argument(
+        "scenario",
+        nargs="?",
+        help="The name of the scenario clone.",
+    )
+    parser.add_argument(
+        "-l",
+        "--list",
+        action="store_true",
+        help="List the scenarios available for download.",
+    )
+    parser.add_argument(
+        "-b",
+        "--branch",
+        type=str,
+        help=f"The specific branch in the AutoGen GitHub repository from which scenarios will be cloned (default: {DEFAULT_BRANCH}).",
+        default=DEFAULT_BRANCH,
+    )
+
+    parsed_args = parser.parse_args(args)
+
+    # Generate the base_url
+    base_url = DEFAULT_REPO + parsed_args.branch + DEFAULT_PATH
+
+    # Check if we are just printing a list
+    if parsed_args.list:
+        print("The following scenarios / benchmarks are available:\n")
+        for s in get_scenarios(base_url):
+            print(f"  {s}")
+        print()
+        return 0
+
+    if not parsed_args.scenario:
+        parser.error("the following arguments are required: scenario")
+
+    try:
+        clone_scenario(parsed_args.scenario, base_url)
+    except ValueError as e:
+        parser.error(str(e) + "\nUse '--list' to see a list of available scenarios.")
diff --git a/samples/tools/autogenbench/autogenbench/load_module.py b/samples/tools/autogenbench/autogenbench/load_module.py
new file mode 100644
index 00000000000..bf18242593c
--- /dev/null
+++ b/samples/tools/autogenbench/autogenbench/load_module.py
@@ -0,0 +1,12 @@
+import os
+import sys
+import importlib.util
+
+
+def load_module(module_path):
+    module_name = os.path.basename(module_path).replace(".py", "")
+    spec = importlib.util.spec_from_file_location(module_name, module_path)
+    module = importlib.util.module_from_spec(spec)
+    sys.modules[module_name] = module
+    spec.loader.exec_module(module)
+    return module
diff --git a/samples/tools/testbed/Dockerfile b/samples/tools/autogenbench/autogenbench/res/Dockerfile
similarity index 91%
rename from samples/tools/testbed/Dockerfile
rename to samples/tools/autogenbench/autogenbench/res/Dockerfile
index 6ce06f93a62..5c3f5f40968 100644
--- a/samples/tools/testbed/Dockerfile
+++ b/samples/tools/autogenbench/autogenbench/res/Dockerfile
@@ -9,7 +9,7 @@ RUN pip install --upgrade pip
 RUN ln -snf /usr/share/zoneinfo/US/Pacific /etc/localtime && echo "US/Pacific" > /etc/timezone
 
 # Pre-load autogen dependencies, but not autogen itself since we'll often want to install the latest from source
-RUN pip install pyautogen[teachable,lmm,graphs]
+RUN pip install pyautogen[teachable,lmm,graphs,websurfer]
 RUN pip uninstall --yes pyautogen
 
 # Pre-load popular packages as per https://learnpython.com/blog/most-popular-python-packages/
diff --git a/samples/tools/autogenbench/autogenbench/run_cmd.py b/samples/tools/autogenbench/autogenbench/run_cmd.py
new file mode 100644
index 00000000000..c29f064d56e
--- /dev/null
+++ b/samples/tools/autogenbench/autogenbench/run_cmd.py
@@ -0,0 +1,613 @@
+import os
+import errno
+import shutil
+import subprocess
+import json
+import sys
+import time
+import pathlib
+import argparse
+import docker
+import random
+from autogen import config_list_from_json
+from autogen.oai.openai_utils import filter_config
+
+# Figure out where everything is
+SCRIPT_PATH = os.path.realpath(__file__)
+SCRIPT_NAME = os.path.basename(SCRIPT_PATH)
+SCRIPT_DIR = os.path.dirname(SCRIPT_PATH)
+
+TASK_TIMEOUT = 60 * 30  # 30 minutes
+
+BASE_TEMPLATE_PATH = os.path.join(SCRIPT_DIR, "template")
+RESOURCES_PATH = os.path.join(SCRIPT_DIR, "res")
+
+# What platform are we running?
+IS_WIN32 = sys.platform == "win32"
+
+# This is the tag given to the image that is *built* when no other image is provided.
+# Do not use this field to specify the name of an existing image (e.g., on Dockerhub)
+DEFAULT_DOCKER_IMAGE_TAG = "autogenbench:default"
+
+DEFAULT_ENV_FILE = "ENV.json"
+
+
+# Get a random number generator for subsampling
+subsample_rng = random.Random(425)
+
+
+def run_scenarios(
+    scenario,
+    n_repeats,
+    is_native,
+    config_list,
+    requirements,
+    docker_image=None,
+    results_dir="Results",
+    subsample=None,
+):
+    """
+    Run a set autogenbench scenarios a given number of times.
+
+    Args:
+        scenario (path):    The file or folder containing the scenario JSONL instances. If given a folder, then
+                            all JSONL files in the folder will be loaded and run.
+        n_repeats (int):    The number of times each scenario instance will be repeated
+        is_native (bool):   True if the scenario should be run locally rather than in Docker (proceed with caution!)
+        config_list (list): An Autogen OAI_CONFIG_LIST to be used when running scenarios.
+        results_dir (path): The folder were results will be saved.
+    """
+
+    files = []
+
+    # Figure out which files or folders we are working with
+    if scenario == "-" or os.path.isfile(scenario):
+        files.append(scenario)
+    elif os.path.isdir(scenario):
+        for f in os.listdir(scenario):
+            scenario_file = os.path.join(scenario, f)
+
+            if not os.path.isfile(scenario_file):
+                continue
+
+            if not scenario_file.lower().endswith(".jsonl"):
+                continue
+
+            files.append(scenario_file)
+    else:
+        raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), scenario)
+
+    # Run all the scenario files
+    for scenario_file in files:
+        scenario_name = None
+        scenario_dir = None
+        file_handle = None
+
+        # stdin
+        if scenario_file == "-":
+            scenario_name = "stdin"
+            scenario_dir = "."
+            file_handle = sys.stdin
+        else:
+            scenario_name = os.path.basename(scenario_file).split(".")
+            scenario_name.pop()
+            scenario_name = ".".join(scenario_name)
+            scenario_dir = os.path.dirname(os.path.realpath(scenario_file))
+            file_handle = open(scenario_file, "rt")
+
+        # Read all the lines, then subsample if needed
+        lines = [line for line in file_handle]
+        if subsample is not None:
+            # How many lines are we sampling
+            n = 0
+            # It's a proportion
+            if 0 <= subsample < 1:
+                n = int(len(lines) * subsample + 0.5)
+            # It's a raw count
+            else:
+                n = int(subsample)
+            n = max(0, min(n, len(lines)))
+            lines = subsample_rng.sample(lines, n)
+
+        for line in lines:
+            instance = json.loads(line)
+
+            # Create a folder to store the results
+            # Results base
+            if not os.path.isdir(results_dir):
+                os.mkdir(results_dir)
+
+            # Results for the scenario
+            results_scenario = os.path.join(results_dir, scenario_name)
+            if not os.path.isdir(results_scenario):
+                os.mkdir(results_scenario)
+
+            # Results for the instance
+            results_instance = os.path.join(results_scenario, instance["id"])
+            if not os.path.isdir(results_instance):
+                os.mkdir(results_instance)
+
+            # Results for the repeats
+            for i in range(0, n_repeats):
+                results_repetition = os.path.join(results_instance, str(i))
+
+                # Skip it if it already exists
+                if os.path.isdir(results_repetition):
+                    print(f"Found folder {results_repetition} ... Skipping.")
+                    continue
+                print(f"Running scenario {results_repetition}")
+
+                # Expand the scenario
+                expand_scenario(scenario_dir, instance, results_repetition, requirements)
+
+                # Prepare the environment (keys/values that need to be added)
+                env = get_scenario_env(config_list)
+
+                # Run the scenario
+                if is_native:
+                    run_scenario_natively(results_repetition, env)
+                else:
+                    run_scenario_in_docker(
+                        results_repetition,
+                        env,
+                        docker_image=docker_image,
+                    )
+
+        # Close regular files
+        if scenario_file != "-":
+            file_handle.close()
+
+
+def expand_scenario(scenario_dir, scenario, output_dir, requirements):
+    """
+    Expand a scenario into a folder.
+    Despite some awkwardness created by backwards compatibility and notational conveniences, expansion is conceptually simple.
+    It is a series of copy commands (similar to `cp -R`), followed by a series of in-place fine and replace operations.
+    """
+
+    template = scenario["template"]
+
+    # Either key works for finding the substiturions list. "values" may be deprecated in the future
+    substitutions = scenario["substitutions"] if "substitutions" in scenario else scenario["values"]
+
+    # Older versions are only one-level deep. Convert them,
+    if len(substitutions) > 0 and isinstance(substitutions[next(iter(substitutions))], str):
+        substitutions = {"scenario.py": substitutions}
+
+    copy_operations = []
+
+    # Handle file (str), folder (str), or mapping (List) templates
+    if isinstance(template, str):
+        template_path = os.path.join(scenario_dir, template)
+        if os.path.isdir(template_path):
+            copy_operations.append((template, ""))
+        else:
+            copy_operations.append((template, "scenario.py"))
+    elif isinstance(template, list):
+        for elm in template:
+            if isinstance(elm, list):
+                copy_operations.append((elm[0], elm[1]))
+            else:
+                copy_operations.append((elm, ""))
+    else:
+        raise ValueError("expand_scenario expects an str or list for 'template'")
+
+    # The global includes folder is always copied
+    shutil.copytree(
+        BASE_TEMPLATE_PATH,
+        output_dir,
+        ignore=shutil.ignore_patterns("*.example"),
+        dirs_exist_ok=False,
+    )
+
+    # Expand other folders
+    for items in copy_operations:
+        src_path = pathlib.Path(os.path.join(scenario_dir, items[0])).absolute()
+        dest_path = pathlib.Path(os.path.join(output_dir, items[1])).absolute()
+
+        if os.path.isdir(src_path):
+            shutil.copytree(src_path, dest_path, dirs_exist_ok=True)
+        else:
+            if os.path.isdir(dest_path):
+                # If the destination is a directory, use the same filename
+                shutil.copyfile(src_path, os.path.join(dest_path, os.path.basename(src_path)))
+            else:
+                # Otherwuse use the filename provided
+                shutil.copyfile(src_path, dest_path)
+
+    # Copy the requirements file if specified
+    if requirements is not None:
+        shutil.copyfile(requirements, pathlib.Path(os.path.join(output_dir, "requirements.txt")))
+
+    # Expand templated files
+    for templated_file in substitutions.keys():  # Keys are relative file paths
+        # Read the templated file into memory
+        template_contents = list()
+        with open(os.path.join(output_dir, templated_file), "rt") as fh:
+            for line in fh:
+                template_contents.append(line)
+
+        # Rewrite the templated file with substitutions
+        values = substitutions[templated_file]
+        with open(os.path.join(output_dir, templated_file), "wt") as fh:
+            for line in template_contents:
+                for k, v in values.items():
+                    line = line.replace(k, v)
+                fh.write(line)
+
+
+def get_scenario_env(config_list, env_file=DEFAULT_ENV_FILE):
+    """
+    Return a dictionary of environment variables needed to run a scenario.
+
+    Args:
+        config_list (list): An Autogen OAI_CONFIG_LIST to be used when running scenarios.
+        env_file (str): The path to the env_file to read. (default: DEFAULT_ENV_FILE)
+
+    Returns: A dictionary of keys and values that need to be added to the system environment.
+    """
+    env = dict()
+    if os.path.isfile(env_file):
+        with open(env_file, "rt") as fh:
+            env = json.loads(fh.read())
+
+    config_list_json = json.dumps(config_list)
+    env["OAI_CONFIG_LIST"] = config_list_json
+
+    openai_api_key = os.environ.get("OPENAI_API_KEY")
+    if openai_api_key is not None and len(openai_api_key.strip()) > 0:
+        env["OPENAI_API_KEY"] = openai_api_key
+
+    return env
+
+
+def run_scenario_natively(work_dir, env, timeout=TASK_TIMEOUT):
+    """
+    Run a scenario in the native environment.
+
+    Args:
+        work_dir (path): the path to the working directory previously created to house this sceario instance
+    """
+
+    # Get the current working directory
+    cwd = os.getcwd()
+
+    # Prepare the environment variables
+    full_env = os.environ.copy()
+    full_env.update(env)
+
+    # Navigate to the scenario
+    os.chdir(work_dir)
+    print("\n\n" + os.getcwd() + "\n===================================================================")
+
+    # Prepare the run script
+    with open(os.path.join("run.sh"), "wt") as f:
+        f.write(
+            f"""#
+echo RUN.SH STARTING !#!#
+export AUTOGEN_TESTBED_SETTING="Native"
+
+# Run the global init script if it exists
+if [ -f global_init.sh ] ; then
+    . ./global_init.sh
+fi
+
+# Run the scenario init script if it exists
+if [ -f scenario_init.sh ] ; then
+    . ./scenario_init.sh
+fi
+
+# Run the scenario
+echo SCENARIO.PY STARTING !#!#
+timeout --preserve-status --kill-after {timeout  + 30}s {timeout}s python scenario.py
+EXIT_CODE=$?
+if [ $EXIT_CODE -ne 0 ]; then
+    echo SCENARIO.PY EXITED WITH CODE: $EXIT_CODE !#!#
+else
+    echo SCENARIO.PY COMPLETE !#!#
+fi
+
+# Clean up
+if [ -d .cache ] ; then
+    rm -Rf .cache
+fi
+
+# Run the scenario finalize script if it exists
+if [ -f scenario_finalize.sh ] ; then
+    . ./scenario_finalize.sh
+fi
+
+# Run the global finalize script if it exists
+if [ -f global_finalize.sh ] ; then
+    . ./global_finalize.sh
+fi
+
+echo RUN.SH COMPLETE !#!#
+"""
+        )
+
+    # Run the script and log the output
+    with open("console_log.txt", "wb") as f:
+        process = subprocess.Popen(
+            ["sh", "run.sh"],
+            env=full_env,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+        )
+        for c in iter(lambda: process.stdout.read(1), b""):
+            f.write(c)
+            os.write(sys.stdout.fileno(), c)  # Write binary to stdout
+
+    # Return where we started
+    os.chdir(cwd)
+    return
+
+
+def run_scenario_in_docker(work_dir, env, timeout=TASK_TIMEOUT, docker_image=None):
+    """
+    Run a scenario in a Docker environment.
+
+    Args:
+        work_dir (path): the path to the working directory previously created to house this sceario instance
+        timeout (Optional, int): the number of seconds to allow a Docker container to run before timing out
+    """
+
+    client = docker.from_env()
+    image = None
+
+    # If the docker_image is None, then we will fetch DEFAULT_DOCKER_IMAGE_TAG, if present,
+    # or build it if missing.
+    if docker_image is None:
+        # Pull a suitable image
+        try:
+            image = client.images.get(DEFAULT_DOCKER_IMAGE_TAG)
+        except docker.errors.ImageNotFound:
+            print(f"Building default Docker image '{DEFAULT_DOCKER_IMAGE_TAG}'. This may take a few minutes...")
+            try:
+                build_default_docker_image(client, DEFAULT_DOCKER_IMAGE_TAG)
+                image = client.images.get(DEFAULT_DOCKER_IMAGE_TAG)
+            except docker.errors.DockerException:
+                print(f"Failed to build image '{DEFAULT_DOCKER_IMAGE_TAG}'")
+
+    # Otherwise get the requested image
+    else:
+        try:
+            image = client.images.get(docker_image)
+        except docker.errors.ImageNotFound:
+            # pull the image
+            print(f"Pulling image '{docker_image}'")
+            try:
+                image = client.images.pull(docker_image)
+            except docker.errors.DockerException:
+                print(f"Failed to pull image '{docker_image}'")
+
+    # Prepare the run script
+    with open(os.path.join(work_dir, "run.sh"), "wt", newline="\n") as f:
+        f.write(
+            f"""#
+echo RUN.SH STARTING !#!#
+export AUTOGEN_TESTBED_SETTING="Docker"
+umask 000
+
+# Run the global init script if it exists
+if [ -f global_init.sh ] ; then
+    . ./global_init.sh
+fi
+
+# Run the scenario init script if it exists
+if [ -f scenario_init.sh ] ; then
+    . ./scenario_init.sh
+fi
+
+# Run the scenario
+pip install -r requirements.txt
+echo SCENARIO.PY STARTING !#!#
+timeout --preserve-status --kill-after {timeout  + 30}s {timeout}s python scenario.py
+EXIT_CODE=$?
+if [ $EXIT_CODE -ne 0 ]; then
+    echo SCENARIO.PY EXITED WITH CODE: $EXIT_CODE !#!#
+else
+    echo SCENARIO.PY COMPLETE !#!#
+fi
+
+# Clean up
+if [ -d .cache ] ; then
+    rm -Rf .cache
+fi
+
+# Run the scenario finalize script if it exists
+if [ -f scenario_finalize.sh ] ; then
+    . ./scenario_finalize.sh
+fi
+
+# Run the global finalize script if it exists
+if [ -f global_finalize.sh ] ; then
+    . ./global_finalize.sh
+fi
+
+echo RUN.SH COMPLETE !#!#
+"""
+        )
+
+    print("\n\n" + work_dir + "\n===================================================================")
+
+    # Create and run the container
+    abs_path = str(pathlib.Path(work_dir).absolute())
+    container = client.containers.run(
+        image,
+        command=["sh", "run.sh"],
+        working_dir="/workspace",
+        environment=env,
+        detach=True,
+        # get absolute path to the working directory
+        volumes={abs_path: {"bind": "/workspace", "mode": "rw"}},
+    )
+
+    # Read the logs in a streaming fashion. Keep an eye on the time to make sure we don't need to stop.
+    docker_timeout = timeout + 60  # One full minute after the bash timeout command should have already triggered
+    start_time = time.time()
+    logs = container.logs(stream=True)
+    log_file = open(os.path.join(work_dir, "console_log.txt"), "wt")
+    stopping = False
+
+    for chunk in logs:  # When streaming it should return a generator
+        # Stream the data to the log file and the console
+        chunk = chunk.decode("utf-8")
+        log_file.write(chunk)
+        log_file.flush()
+        sys.stdout.write(chunk)
+        sys.stdout.flush()
+
+        # Check if we need to terminate
+        if not stopping and time.time() - start_time >= docker_timeout:
+            container.stop()
+
+            # Don't exit the loop right away, as there are things we may still want to read from the logs
+            # but remember how we got here.
+            stopping = True
+
+    if stopping:  # By this line we've exited the loop, and the container has actually stopped.
+        log_file.write("\nDocker timed out.\n")
+        log_file.flush()
+        sys.stdout.write("\nDocker timed out.\n")
+        sys.stdout.flush()
+
+
+def build_default_docker_image(docker_client, image_tag):
+    for segment in docker_client.api.build(
+        path=RESOURCES_PATH,
+        dockerfile="Dockerfile",
+        rm=True,
+        tag=image_tag,
+        decode=True,
+    ):
+        if "stream" in segment:
+            sys.stdout.write(segment["stream"])
+
+
+def run_cli(args):
+    invocation_cmd = args[0]
+    args = args[1:]
+
+    # Prepare the argument parser
+    parser = argparse.ArgumentParser(
+        prog=invocation_cmd,
+        description=f"{invocation_cmd} will run the specified AutoGen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics.".strip(),
+    )
+
+    parser.add_argument(
+        "scenario",
+        help="The JSONL scenario file to run. If a directory is specified, then all JSONL scenarios in the directory are run. If set to '-', then read from stdin.",
+    )
+    parser.add_argument(
+        "-c",
+        "--config",
+        type=str,
+        help="The environment variable name or path to the OAI_CONFIG_LIST (default: OAI_CONFIG_LIST).",
+        default="OAI_CONFIG_LIST",
+    )
+    parser.add_argument(
+        "-r",
+        "--repeat",
+        type=int,
+        help="The number of repetitions to run for each scenario (default: 1).",
+        default=1,
+    )
+    parser.add_argument(
+        "-s",
+        "--subsample",
+        type=str,
+        help='Run on a subsample of the tasks in the JSONL file(s). If a decimal value is specified, then run on the given proportion of tasks in each file. For example "0.7" would run on 70%% of tasks, and "1.0" would run on 100%% of tasks. If an integer value is specified, then randomly select *that* number of tasks from each specified JSONL file. For example "7" would run tasks, while "1" would run only 1 task from each specified JSONL file. (default: 1.0; which is 100%%)',
+        default=None,
+    )
+    parser.add_argument(
+        "-m",
+        "--model",
+        type=str,
+        help="Filters the config_list to include only models matching the provided model name or tag (default: None, which is all models).",
+        default=None,
+    )
+    parser.add_argument(
+        "--requirements",
+        type=str,
+        help="The requirements file to pip install before running the scenario.",
+        default=None,
+    )
+    parser.add_argument(
+        "-d",
+        "--docker-image",
+        type=str,
+        help="The Docker image to use when running scenarios. Can not be used together with --native. (default: '"
+        + DEFAULT_DOCKER_IMAGE_TAG
+        + "', which will be created if not present)",
+        default=None,
+    )
+    parser.add_argument(
+        "--native",
+        action="store_true",
+        help="Run the scenarios natively rather than in docker. NOTE: This is not advisable, and should be done with great caution.",
+    )
+
+    parsed_args = parser.parse_args(args)
+
+    # Load the OAI_CONFIG_LIST
+    config_list = config_list_from_json(env_or_file=parsed_args.config)
+
+    # Add the model name to the tags to simplify filtering
+    for entry in config_list:
+        if "tags" not in entry:
+            entry["tags"] = list()
+        if entry["model"] not in entry["tags"]:
+            entry["tags"].append(entry["model"])
+
+    # Filter if requested
+    if parsed_args.model is not None:
+        filter_dict = {"tags": [parsed_args.model]}
+        config_list = filter_config(config_list, filter_dict)
+        if len(config_list) == 0:
+            sys.exit(
+                f"The model configuration list is empty. This may be because the model filter '{parsed_args.model}' returned 0 results."
+            )
+
+    # Don't allow both --docker-image and --native on the same command
+    if parsed_args.docker_image is not None and parsed_args.native:
+        sys.exit("The options --native and --docker-image can not be used together. Exiting.")
+
+    # Warn if running natively
+    if parsed_args.native:
+        if IS_WIN32:
+            sys.exit("Running scenarios with --native is not supported in Windows. Exiting.")
+
+        if parsed_args.requirements is not None:
+            sys.exit("--requirements is not compatible with --native. Exiting.")
+
+        choice = input(
+            'WARNING: Running natively, without Docker, not only poses the usual risks of executing arbitrary AI generated code on your machine, it also makes it impossible to ensure that each test starts from a known and consistent set of initial conditions. For example, if the agents spend time debugging and installing Python libraries to solve the task, then those libraries will be available to all other runs. In other words, earlier runs can influence later runs, leading to many confounds in testing.\n\nAre you absolutely sure you want to continue with native execution? Type "Yes" exactly, and in full, to proceed: '
+        )
+
+        if choice.strip().lower() != "yes":
+            sys.exit("Received '" + choice + "'. Exiting.")
+
+    # Parse the subsample
+    subsample = None
+    if parsed_args.subsample is not None:
+        subsample = float(parsed_args.subsample)
+        if "." in parsed_args.subsample:  # Intention is to run on a proportion
+            if subsample == 1.0:  # Intention is to run 100%, which is the default
+                subsample = None  # None means 100% ... which use None to differentiate from the integer 1
+            elif subsample < 0 or subsample > 1.0:
+                raise (
+                    ValueError(
+                        "Subsample must either be an integer (specified without a decimal), or a Real number between 0.0 and 1.0"
+                    )
+                )
+
+    run_scenarios(
+        scenario=parsed_args.scenario,
+        n_repeats=parsed_args.repeat,
+        is_native=True if parsed_args.native else False,
+        config_list=config_list,
+        requirements=parsed_args.requirements,
+        docker_image=parsed_args.docker_image,
+        subsample=subsample,
+    )
diff --git a/samples/tools/autogenbench/autogenbench/tabulate_cmd.py b/samples/tools/autogenbench/autogenbench/tabulate_cmd.py
new file mode 100644
index 00000000000..b4b90da80a8
--- /dev/null
+++ b/samples/tools/autogenbench/autogenbench/tabulate_cmd.py
@@ -0,0 +1,212 @@
+import os
+import sys
+import argparse
+import tabulate as tb
+from .load_module import load_module
+
+# Figure out where everything is
+SCRIPT_PATH = os.path.realpath(__file__)
+SCRIPT_NAME = os.path.basename(SCRIPT_PATH)
+SCRIPT_DIR = os.path.dirname(SCRIPT_PATH)
+
+TABULATE_FILE = "custom_tabulate.py"
+
+SUCCESS_STRINGS = [
+    "ALL TESTS PASSED !#!#",
+]
+
+EXCLUDE_DIR_NAMES = ["__pycache__"]
+
+
+def find_tabulate_module(search_dir, stop_dir=None):
+    """Hunt for the tabulate script."""
+
+    search_dir = os.path.abspath(search_dir)
+    if not os.path.isdir(search_dir):
+        raise ValueError(f"'{search_dir}' is not a directory.")
+
+    stop_dir = None if stop_dir is None else os.path.abspath(stop_dir)
+
+    while True:
+        path = os.path.join(search_dir, TABULATE_FILE)
+        if os.path.isfile(path):
+            return path
+
+        path = os.path.join(search_dir, "Scripts", TABULATE_FILE)
+        if os.path.isfile(path):
+            return path
+
+        path = os.path.join(search_dir, "scripts", TABULATE_FILE)
+        if os.path.isfile(path):
+            return path
+
+        # Stop if we hit the stop_dir
+        if search_dir == stop_dir:
+            break
+
+        # Stop if we hit the root
+        parent_dir = os.path.abspath(os.path.join(search_dir, os.pardir))
+        if parent_dir == search_dir:
+            break
+
+        search_dir = parent_dir
+
+
+def default_scorer(instance_dir, success_strings=SUCCESS_STRINGS):
+    console_log = os.path.join(instance_dir, "console_log.txt")
+    if os.path.isfile(console_log):
+        with open(console_log, "rt") as fh:
+            content = fh.read()
+            for s in success_strings:
+                if s in content:
+                    return True
+            return False
+    else:
+        return None
+
+
+def default_tabulate(args, scorer=default_scorer, exclude_dir_names=EXCLUDE_DIR_NAMES):
+    invocation_cmd = args[0]
+    args = args[1:]
+
+    warning = f"CAUTION: '{invocation_cmd}' is in early preview and is not thoroughly tested.\nPlease do not cite values from these calculations in academic work without first inspecting and verifying the results in the run logs yourself."
+
+    # Prepare the argument parser
+    parser = argparse.ArgumentParser(
+        prog=invocation_cmd,
+        description=f"{invocation_cmd} will tabulate the results of a previous run.",
+    )
+
+    parser.add_argument(
+        "runlogs",
+        help="The path where the run's logs are stored.",
+    )
+    parser.add_argument(
+        "-c",
+        "--csv",
+        action="store_true",
+        help="Output the results in CSV format.",
+    )
+
+    parsed_args = parser.parse_args(args)
+
+    all_results = list()
+    max_instances = 0
+
+    for task_id in sorted(
+        os.listdir(parsed_args.runlogs),
+        key=lambda s: os.path.getmtime(os.path.join(parsed_args.runlogs, s)),
+    ):
+        if task_id in exclude_dir_names:
+            continue
+
+        task_path = os.path.join(parsed_args.runlogs, task_id)
+
+        if not os.path.isdir(task_path):
+            continue
+
+        # Collect the results vector
+        results = [task_id]
+
+        instance = 0
+        instance_dir = os.path.join(task_path, str(instance))
+        while os.path.isdir(instance_dir):
+            results.append(scorer(instance_dir))
+            instance += 1
+            instance_dir = os.path.join(task_path, str(instance))
+
+        max_instances = max(max_instances, instance)
+
+        # Buffer the results
+        all_results.append(results)
+
+    if parsed_args.csv:
+        # Create a header
+        header = ["Task Id"]
+        for i in range(0, max_instances):
+            header.append("Trial " + str(i) + " Success")
+
+        print(",".join(header))
+        for row in all_results:
+            str_row = [f"{v}" if v is not None else "" for v in row]
+            while len(str_row) < max_instances + 1:
+                str_row.append("")
+            print(",".join(str_row))
+
+        # Print out alpha-version warning
+        sys.stderr.write("\n" + warning + "\n\n")
+    else:
+        # Create a header
+        header = ["\nTask Id"]
+        for i in range(0, max_instances):
+            header.append("Trial " + str(i) + "\nSuccess")
+
+        # Create the footer
+        def _count_equals(value, trial):
+            count = 0
+            for row in all_results:
+                # Count missing
+                if value is None:
+                    if trial + 1 < len(row):
+                        if row[trial + 1] is None:
+                            count += 1
+                    else:
+                        count += 1
+                # Count match
+                elif trial + 1 < len(row) and row[trial + 1] == value:
+                    count += 1
+            return count
+
+        footer = []
+        footer_row = ["Successes"]
+        for i in range(0, max_instances):
+            footer_row.append(_count_equals(True, i))
+        footer.append(footer_row)
+
+        footer_row = ["Failures"]
+        for i in range(0, max_instances):
+            footer_row.append(_count_equals(False, i))
+        footer.append(footer_row)
+
+        footer_row = ["Missing"]
+        for i in range(0, max_instances):
+            footer_row.append(_count_equals(None, i))
+        footer.append(footer_row)
+
+        footer_row = ["Total"]
+        for i in range(0, max_instances):
+            footer_row.append(footer[0][i + 1] + footer[1][i + 1] + footer[2][i + 1])
+        footer.append(footer_row)
+
+        table = all_results.copy()
+        table.append(tb.SEPARATING_LINE)
+        table.extend(footer)
+
+        print(tb.tabulate(table, headers=header))
+
+        # Print out alpha-version warning
+        sys.stderr.write("\n" + warning + "\n\n")
+
+
+def tabulate_cli(args):
+    invocation_cmd = args[0]
+    args = args[1:]
+
+    # We won't assume much about the arguments, letting the dynamically-loaded
+    # tabulate modules parse the arguments however they want. But, we will use
+    # bare arguments (not starting a "-"), to help us find what module to load.
+    module_path = find_tabulate_module(os.getcwd(), stop_dir=os.getcwd())
+    for arg in reversed(args):
+        if module_path is not None:
+            break
+        if arg.startswith("-"):
+            continue
+        module_path = find_tabulate_module(arg)
+
+    # Load the module and hand over control
+    if module_path is None:
+        sys.stderr.write("Using default tabulation method.\n\n")
+        default_tabulate([invocation_cmd] + args)
+    else:
+        sys.stderr.write(f"Using tabulation method defined in '{module_path}'\n\n")
+        load_module(module_path).main([invocation_cmd] + args)
diff --git a/samples/tools/testbed/includes/global_finalize.sh b/samples/tools/autogenbench/autogenbench/template/global_finalize.sh
similarity index 100%
rename from samples/tools/testbed/includes/global_finalize.sh
rename to samples/tools/autogenbench/autogenbench/template/global_finalize.sh
diff --git a/samples/tools/testbed/includes/global_init.sh b/samples/tools/autogenbench/autogenbench/template/global_init.sh
similarity index 100%
rename from samples/tools/testbed/includes/global_init.sh
rename to samples/tools/autogenbench/autogenbench/template/global_init.sh
diff --git a/samples/tools/autogenbench/autogenbench/template/requirements.txt b/samples/tools/autogenbench/autogenbench/template/requirements.txt
new file mode 100644
index 00000000000..46ad1e009ca
--- /dev/null
+++ b/samples/tools/autogenbench/autogenbench/template/requirements.txt
@@ -0,0 +1 @@
+pyautogen
diff --git a/samples/tools/testbed/includes/testbed_utils.py b/samples/tools/autogenbench/autogenbench/template/testbed_utils.py
similarity index 100%
rename from samples/tools/testbed/includes/testbed_utils.py
rename to samples/tools/autogenbench/autogenbench/template/testbed_utils.py
diff --git a/samples/tools/autogenbench/autogenbench/version.py b/samples/tools/autogenbench/autogenbench/version.py
new file mode 100644
index 00000000000..ecbf4901d90
--- /dev/null
+++ b/samples/tools/autogenbench/autogenbench/version.py
@@ -0,0 +1 @@
+__version__ = "0.0.1a12"
diff --git a/samples/tools/autogenbench/pyproject.toml b/samples/tools/autogenbench/pyproject.toml
new file mode 100644
index 00000000000..339217691d9
--- /dev/null
+++ b/samples/tools/autogenbench/pyproject.toml
@@ -0,0 +1,49 @@
+[build-system]
+requires = ["setuptools", "setuptools-scm"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "autogenbench"
+authors = [
+  { name="Autogen Team", email="auto-gen@outlook.com" },
+]
+description = "AutoGen Testbed Tools"
+readme = "README.md"
+license = { file="LICENSE" }
+requires-python = ">=3.8, <3.13"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "License :: OSI Approved :: MIT License",
+    "Operating System :: OS Independent",
+]
+
+dependencies = [
+    "pyautogen",
+    "docker",
+    "huggingface_hub",
+    "tabulate",
+]
+
+dynamic = ["version"]
+
+[tool.setuptools]
+include-package-data = true
+
+
+[tool.setuptools.dynamic]
+version = {attr = "autogenbench.version.__version__"}
+readme = {file = ["README.md"]}
+
+[tool.setuptools.packages.find]
+include = ["autogenbench*"]
+exclude = ["*.tests*"]
+
+[tool.setuptools.package-data]
+"autogenbench" = ["*.*"]
+
+[project.urls]
+"Homepage" = "https://github.com/microsoft/autogen"
+"Bug Tracker" = "https://github.com/microsoft/autogen/issues"
+
+[project.scripts]
+autogenbench = "autogenbench.cli:main"
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/10_password_generator/custom_python/test_pwd.py b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/10_password_generator/custom_python/test_pwd.py
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/10_password_generator/custom_python/test_pwd.py
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/10_password_generator/custom_python/test_pwd.py
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/10_password_generator/data.json b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/10_password_generator/data.json
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/10_password_generator/data.json
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/10_password_generator/data.json
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/11_file_organizer/custom_python/test_file_organize.py b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/11_file_organizer/custom_python/test_file_organize.py
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/11_file_organizer/custom_python/test_file_organize.py
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/11_file_organizer/custom_python/test_file_organize.py
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/11_file_organizer/data.json b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/11_file_organizer/data.json
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/11_file_organizer/data.json
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/11_file_organizer/data.json
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/12_url_shortener/custom_python/test_url_shorten.py b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/12_url_shortener/custom_python/test_url_shorten.py
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/12_url_shortener/custom_python/test_url_shorten.py
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/12_url_shortener/custom_python/test_url_shorten.py
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/12_url_shortener/data.json b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/12_url_shortener/data.json
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/12_url_shortener/data.json
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/12_url_shortener/data.json
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/13_tic_tac_toe/custom_python/test_tictactoe.py b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/13_tic_tac_toe/custom_python/test_tictactoe.py
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/13_tic_tac_toe/custom_python/test_tictactoe.py
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/13_tic_tac_toe/custom_python/test_tictactoe.py
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/13_tic_tac_toe/data.json b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/13_tic_tac_toe/data.json
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/13_tic_tac_toe/data.json
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/13_tic_tac_toe/data.json
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/1_sort_csv/artifacts_in/input.csv b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/1_sort_csv/artifacts_in/input.csv
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/1_sort_csv/artifacts_in/input.csv
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/1_sort_csv/artifacts_in/input.csv
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/1_sort_csv/data.json b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/1_sort_csv/data.json
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/1_sort_csv/data.json
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/1_sort_csv/data.json
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/2_combine_csv/artifacts_in/file1.csv b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/2_combine_csv/artifacts_in/file1.csv
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/2_combine_csv/artifacts_in/file1.csv
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/2_combine_csv/artifacts_in/file1.csv
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/2_combine_csv/artifacts_in/file2.csv b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/2_combine_csv/artifacts_in/file2.csv
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/2_combine_csv/artifacts_in/file2.csv
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/2_combine_csv/artifacts_in/file2.csv
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/2_combine_csv/data.json b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/2_combine_csv/data.json
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/2_combine_csv/data.json
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/2_combine_csv/data.json
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/3_qa_small_csv/artifacts_in/file1.csv b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/3_qa_small_csv/artifacts_in/file1.csv
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/3_qa_small_csv/artifacts_in/file1.csv
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/3_qa_small_csv/artifacts_in/file1.csv
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/3_qa_small_csv/data.json b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/3_qa_small_csv/data.json
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/3_qa_small_csv/data.json
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/3_qa_small_csv/data.json
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/4_qa_csv/artifacts_in/file1.csv b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/4_qa_csv/artifacts_in/file1.csv
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/4_qa_csv/artifacts_in/file1.csv
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/4_qa_csv/artifacts_in/file1.csv
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/4_qa_csv/data.json b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/4_qa_csv/data.json
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/4_qa_csv/data.json
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/4_qa_csv/data.json
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/5_search/data.json b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/5_search/data.json
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/5_search/data.json
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/5_search/data.json
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/6_book_price/data.json b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/6_book_price/data.json
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/6_book_price/data.json
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/6_book_price/data.json
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/7_revenue/data.json b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/7_revenue/data.json
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/7_revenue/data.json
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/7_revenue/data.json
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/8_get_information/data.json b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/8_get_information/data.json
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/8_get_information/data.json
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/8_get_information/data.json
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/9_three_sum/custom_python/test_three_sum.py b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/9_three_sum/custom_python/test_three_sum.py
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/9_three_sum/custom_python/test_three_sum.py
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/9_three_sum/custom_python/test_three_sum.py
diff --git a/samples/tools/testbed/scenarios/AutoGPT/challenges/9_three_sum/data.json b/samples/tools/autogenbench/scenarios/AutoGPT/Challenges/9_three_sum/data.json
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/challenges/9_three_sum/data.json
rename to samples/tools/autogenbench/scenarios/AutoGPT/Challenges/9_three_sum/data.json
diff --git a/samples/tools/autogenbench/scenarios/AutoGPT/MANIFEST.json b/samples/tools/autogenbench/scenarios/AutoGPT/MANIFEST.json
new file mode 100644
index 00000000000..d91e5e200d2
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/AutoGPT/MANIFEST.json
@@ -0,0 +1,35 @@
+{
+    "files": {
+        "README.md": "README.md",
+        "Scripts/init_tasks.py": "Scripts/init_tasks.py",
+        "Scripts/custom_tabulate.py": "Scripts/custom_tabulate.py",
+        "Templates/TwoAgents/check.py": "Templates/TwoAgents/check.py",
+        "Templates/TwoAgents/should_contain.json.txt": "Templates/TwoAgents/should_contain.json.txt",
+        "Templates/TwoAgents/should_not_contain.json.txt": "Templates/TwoAgents/should_not_contain.json.txt",
+        "Templates/TwoAgents/scenario.py": "Templates/TwoAgents/scenario.py",
+        "Templates/TwoAgents/scenario_init.sh": "Templates/TwoAgents/scenario_init.sh",
+        "Challenges/1_sort_csv/data.json": "Challenges/1_sort_csv/data.json",
+        "Challenges/1_sort_csv/artifacts_in/input.csv": "Challenges/1_sort_csv/artifacts_in/input.csv",
+        "Challenges/2_combine_csv/data.json": "Challenges/2_combine_csv/data.json",
+        "Challenges/2_combine_csv/artifacts_in/file1.csv": "Challenges/2_combine_csv/artifacts_in/file1.csv",
+        "Challenges/2_combine_csv/artifacts_in/file2.csv": "Challenges/2_combine_csv/artifacts_in/file2.csv",
+        "Challenges/3_qa_small_csv/data.json": "Challenges/3_qa_small_csv/data.json",
+        "Challenges/3_qa_small_csv/artifacts_in/file1.csv": "Challenges/3_qa_small_csv/artifacts_in/file1.csv",
+        "Challenges/4_qa_csv/data.json": "Challenges/4_qa_csv/data.json",
+        "Challenges/4_qa_csv/artifacts_in/file1.csv": "Challenges/4_qa_csv/artifacts_in/file1.csv",
+        "Challenges/5_search/data.json": "Challenges/5_search/data.json",
+        "Challenges/6_book_price/data.json": "Challenges/6_book_price/data.json",
+        "Challenges/7_revenue/data.json": "Challenges/7_revenue/data.json",
+        "Challenges/8_get_information/data.json": "Challenges/8_get_information/data.json",
+        "Challenges/9_three_sum/custom_python/test_three_sum.py": "Challenges/9_three_sum/custom_python/test_three_sum.py",
+        "Challenges/9_three_sum/data.json": "Challenges/9_three_sum/data.json",
+        "Challenges/10_password_generator/custom_python/test_pwd.py": "Challenges/10_password_generator/custom_python/test_pwd.py",
+        "Challenges/10_password_generator/data.json": "Challenges/10_password_generator/data.json",
+        "Challenges/11_file_organizer/custom_python/test_file_organize.py": "Challenges/11_file_organizer/custom_python/test_file_organize.py",
+        "Challenges/11_file_organizer/data.json": "Challenges/11_file_organizer/data.json",
+        "Challenges/12_url_shortener/custom_python/test_url_shorten.py": "Challenges/12_url_shortener/custom_python/test_url_shorten.py",
+        "Challenges/12_url_shortener/data.json": "Challenges/12_url_shortener/data.json",
+        "Challenges/13_tic_tac_toe/custom_python/test_tictactoe.py": "Challenges/13_tic_tac_toe/custom_python/test_tictactoe.py",
+        "Challenges/13_tic_tac_toe/data.json": "Challenges/13_tic_tac_toe/data.json"
+    }
+}
diff --git a/samples/tools/autogenbench/scenarios/AutoGPT/README.md b/samples/tools/autogenbench/scenarios/AutoGPT/README.md
new file mode 100644
index 00000000000..a42c0db0d7a
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/AutoGPT/README.md
@@ -0,0 +1,12 @@
+# AutoGPT Benchmark
+
+This scenario implements an older subset of the [AutoGPT](https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/tree/master/agbenchmark#readme) benchmark.
+
+Tasks were selected in November 2023, and may have since been deprecated. They are nonetheless useful for comparison and development.
+
+## Running the tasks
+
+```
+autogenbench run Tasks/autogpt__two_agents.jsonl
+autogenbench tabulate Results/autogpt__two_agents
+```
diff --git a/samples/tools/autogenbench/scenarios/AutoGPT/Scripts/custom_tabulate.py b/samples/tools/autogenbench/scenarios/AutoGPT/Scripts/custom_tabulate.py
new file mode 100644
index 00000000000..ba8700d1f47
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/AutoGPT/Scripts/custom_tabulate.py
@@ -0,0 +1,11 @@
+import os
+import sys
+from autogenbench.tabulate_cmd import default_tabulate
+
+
+def main(args):
+    default_tabulate(args)
+
+
+if __name__ == "__main__" and __package__ is None:
+    main(sys.argv)
diff --git a/samples/tools/autogenbench/scenarios/AutoGPT/Scripts/init_tasks.py b/samples/tools/autogenbench/scenarios/AutoGPT/Scripts/init_tasks.py
new file mode 100644
index 00000000000..00a6d15ef77
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/AutoGPT/Scripts/init_tasks.py
@@ -0,0 +1,108 @@
+#
+# Run this file to download the human_eval dataset, and create a corresponding testbed scenario:
+# (default: ../scenarios/human_eval_two_agents_gpt4.jsonl and ./scenarios/human_eval_two_agents_gpt35.jsonl)
+#
+
+import json
+import os
+import sys
+import glob
+import base64
+from huggingface_hub import snapshot_download
+
+SCRIPT_PATH = os.path.realpath(__file__)
+SCRIPT_NAME = os.path.basename(SCRIPT_PATH)
+SCRIPT_DIR = os.path.dirname(SCRIPT_PATH)
+
+SCENARIO_DIR = os.path.realpath(os.path.join(SCRIPT_DIR, os.path.pardir))
+TEMPLATES_DIR = os.path.join(SCENARIO_DIR, "Templates")
+TASKS_DIR = os.path.join(SCENARIO_DIR, "Tasks")
+CHALLENGES_DIR = os.path.join(SCENARIO_DIR, "Challenges")
+
+
+def create_jsonl(name, template):
+    """Creates a JSONL scenario file with a given name, and template path."""
+
+    if not os.path.isdir(TASKS_DIR):
+        os.mkdir(TASKS_DIR)
+
+    with open(os.path.join(TASKS_DIR, name + ".jsonl"), "wt") as fh:
+        data_paths = glob.glob(str(CHALLENGES_DIR + "/*/data.json"))
+        for data_path in data_paths:
+            print("Converting data path: ", data_path)
+            workspace = os.path.dirname(data_path)
+            artifacts = os.path.join(workspace, "artifacts_in")
+            custom_python = os.path.join(workspace, "custom_python")
+
+            with open(data_path, "r") as f:
+                data = json.load(f)
+
+            should_contain = data["ground"].get("should_contain", [])
+            should_not_contain = data["ground"].get("should_not_contain", [])
+            case_sensitive = data["ground"].get("case_sensitive", False)
+
+            # Figure out what files we need to copy
+            template_cp_list = [template]
+
+            # Artifacts in
+            if os.path.exists(artifacts):
+                template_cp_list.append(
+                    [
+                        artifacts,
+                        "coding",
+                    ]
+                )
+
+            # Custom python
+            if os.path.exists(custom_python):
+                template_cp_list.append(
+                    [
+                        custom_python,
+                        "custom_python",
+                    ]
+                )
+
+            record = {
+                "id": data["name"],
+                "template": template_cp_list,
+                "substitutions": {
+                    "scenario.py": {
+                        "__TASK__": data["task"],
+                    },
+                    "check.py": {
+                        "__FILE_PATTERN__": data["ground"]["files"][0],
+                        "__EVAL_TYPE__": data["ground"]["eval"]["type"],
+                        "__CASE_SENSITIVE__": str(case_sensitive),
+                    },
+                    "should_contain.json.txt": {
+                        "__CONTAIN__": json.dumps(should_contain),  # Double-encoded
+                    },
+                    "should_not_contain.json.txt": {
+                        "__NO_CONTAIN__": json.dumps(should_not_contain),  # Double-encoded
+                    },
+                },
+            }
+
+            fh.write(json.dumps(record).strip() + "\n")
+
+
+###############################################################################
+def main():
+    templates = {"two_agents": os.path.join(TEMPLATES_DIR, "TwoAgents")}
+
+    # Add coding directories if needed (these are usually empty and left out of the repo)
+    for template in templates.values():
+        code_dir_path = os.path.join(template, "coding")
+        if not os.path.isdir(code_dir_path):
+            os.mkdir(code_dir_path)
+
+    # Create the various combinations of [models] x [templates]
+    for t in templates.items():
+        create_jsonl(
+            f"autogpt__{t[0]}",
+            t[1],
+        )
+
+
+if __name__ == "__main__" and __package__ is None:
+    main()
diff --git a/samples/tools/testbed/scenarios/AutoGPT/Templates/TwoAgents/check.py b/samples/tools/autogenbench/scenarios/AutoGPT/Templates/TwoAgents/check.py
similarity index 73%
rename from samples/tools/testbed/scenarios/AutoGPT/Templates/TwoAgents/check.py
rename to samples/tools/autogenbench/scenarios/AutoGPT/Templates/TwoAgents/check.py
index 57043d5695a..da7ea832a80 100644
--- a/samples/tools/testbed/scenarios/AutoGPT/Templates/TwoAgents/check.py
+++ b/samples/tools/autogenbench/scenarios/AutoGPT/Templates/TwoAgents/check.py
@@ -1,16 +1,21 @@
-import base64
+# Disable ruff linter for incomplete template files
+# ruff: noqa: F821
+
 import glob
 import os
 import subprocess
 import sys
 import shutil
+import json
 
 
 def scoring(content: str, should_contain: list, should_not_contain: list):
-    print("\033[1;34mScoring content:\033[0m", content)
+    is_case_sensitive = __CASE_SENSITIVE__
+
+    print("\033[1;34mScoring content:\033[0m\n", content)
     if should_contain:
         for should_contain_word in should_contain:
-            if not "__CASE_SENSITIVE__" == "True":
+            if not is_case_sensitive:
                 should_contain_word = should_contain_word.lower()
                 content = content.lower()
             if should_contain_word not in content:
@@ -19,10 +24,9 @@ def scoring(content: str, should_contain: list, should_not_contain: list):
 
     if should_not_contain:
         for should_not_contain_word in should_not_contain:
-            if not "__CASE_SENSITIVE__" == "True":
+            if not is_case_sensitive:
                 should_not_contain_word = should_not_contain_word.lower()
                 content = content.lower()
-            # print_content = f"\033[1;34mWord that should not exist\033[0m - {should_not_contain_word}:"
             if should_not_contain_word in content:
                 return 0.0
     return 1.0
@@ -35,15 +39,13 @@ def check():
     file_pattern = "__FILE_PATTERN__"
     eval_type = "__EVAL_TYPE__"
 
-    with open("../should_contain.txt", "r") as f:
-        should_contain = eval(f.read())
+    with open("../should_contain.json.txt", "r") as f:
+        should_contain = json.loads(f.read())
         assert type(should_contain) == list, "TERMINATE\n"
-        should_contain = [base64.b64decode(encoded).decode("utf-8") for encoded in should_contain]
 
-    with open("../should_not_contain.txt", "r") as f:
-        should_not_contain = eval(f.read())
+    with open("../should_not_contain.json.txt", "r") as f:
+        should_not_contain = json.loads(f.read())
         assert type(should_not_contain) == list, "TERMINATE\n"
-        should_not_contain = [base64.b64decode(encoded).decode("utf-8") for encoded in should_not_contain]
 
     # Check if file pattern is a file extension
     if file_pattern.startswith("."):
@@ -75,14 +77,14 @@ def check():
             )
 
     for content in files_contents:
-        # print("\033[1;34mScoring content:\033[0m", content)
         score = scoring(content, should_contain, should_not_contain)
         scores.append(score)
 
     if 1.0 in scores:
-        print("ALL TESTS PASSED!\n\nTERMINATE.")
+        print("ALL TESTS PASSED !#!#\n\nTERMINATE")
     else:
-        print("Test failed.")
+        print("TEST FAILED !#!#")
 
 
-check()
+if __name__ == "__main__":
+    check()
diff --git a/samples/tools/autogenbench/scenarios/AutoGPT/Templates/TwoAgents/scenario.py b/samples/tools/autogenbench/scenarios/AutoGPT/Templates/TwoAgents/scenario.py
new file mode 100644
index 00000000000..eba85033473
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/AutoGPT/Templates/TwoAgents/scenario.py
@@ -0,0 +1,56 @@
+# We would like to gracefully handle any exception
+# ruff: noqa: E722
+
+import traceback
+from autogen import AssistantAgent, UserProxyAgent, config_list_from_json
+import testbed_utils
+
+# Assistant agent can call check.py to check if all the unit tests have passed
+testbed_utils.init()
+
+work_dir = "coding"
+
+config_list = config_list_from_json("OAI_CONFIG_LIST")
+
+assistant = AssistantAgent(
+    "assistant",
+    is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
+    llm_config={
+        "config_list": config_list,
+    },
+)
+user_proxy = UserProxyAgent(
+    "user_proxy",
+    human_input_mode="NEVER",
+    is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
+    code_execution_config={
+        "work_dir": work_dir,
+        "use_docker": False,
+    },
+    max_consecutive_auto_reply=5,
+)
+
+message = """
+__TASK__
+""".strip()
+
+# Solve the task
+try:
+    user_proxy.initiate_chat(
+        assistant,
+        message=message,
+    )
+except:
+    traceback.print_exc()
+
+# Check the results
+assistant.send(
+    "```bash\npython ../check.py\n```",
+    user_proxy,
+    request_reply=False,
+    silent=True,
+)
+reply = user_proxy.generate_reply(sender=assistant)
+print(reply)
+
+testbed_utils.finalize(agents=[assistant, user_proxy])
diff --git a/samples/tools/autogenbench/scenarios/AutoGPT/Templates/TwoAgents/scenario_init.sh b/samples/tools/autogenbench/scenarios/AutoGPT/Templates/TwoAgents/scenario_init.sh
new file mode 100644
index 00000000000..a5135e2a6c2
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/AutoGPT/Templates/TwoAgents/scenario_init.sh
@@ -0,0 +1 @@
+pip install pandas beautifulsoup4 requests pytest
diff --git a/samples/tools/testbed/scenarios/AutoGPT/Templates/TwoAgents/should_contain.txt b/samples/tools/autogenbench/scenarios/AutoGPT/Templates/TwoAgents/should_contain.json.txt
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/Templates/TwoAgents/should_contain.txt
rename to samples/tools/autogenbench/scenarios/AutoGPT/Templates/TwoAgents/should_contain.json.txt
diff --git a/samples/tools/testbed/scenarios/AutoGPT/Templates/TwoAgents/should_not_contain.txt b/samples/tools/autogenbench/scenarios/AutoGPT/Templates/TwoAgents/should_not_contain.json.txt
similarity index 100%
rename from samples/tools/testbed/scenarios/AutoGPT/Templates/TwoAgents/should_not_contain.txt
rename to samples/tools/autogenbench/scenarios/AutoGPT/Templates/TwoAgents/should_not_contain.json.txt
diff --git a/samples/tools/autogenbench/scenarios/Examples/ENV.json b/samples/tools/autogenbench/scenarios/Examples/ENV.json
new file mode 100644
index 00000000000..a8631378f35
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/Examples/ENV.json
@@ -0,0 +1,3 @@
+{
+    "BING_API_KEY": ""
+}
diff --git a/samples/tools/autogenbench/scenarios/Examples/MANIFEST.json b/samples/tools/autogenbench/scenarios/Examples/MANIFEST.json
new file mode 100644
index 00000000000..03cfa6a0580
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/Examples/MANIFEST.json
@@ -0,0 +1,9 @@
+{
+    "files": {
+        "Templates/TwoAgents/scenario_finalize.sh": "Templates/TwoAgents/scenario_finalize.sh",
+        "Templates/TwoAgents/scenario.py": "Templates/TwoAgents/scenario.py",
+        "Templates/TwoAgents/scenario_init.sh": "Templates/TwoAgents/scenario_init.sh",
+        "Tasks/default_two_agents.jsonl": "Tasks/default_two_agents.jsonl",
+        "README.md": "README.md"
+    }
+}
diff --git a/samples/tools/autogenbench/scenarios/Examples/README.md b/samples/tools/autogenbench/scenarios/Examples/README.md
new file mode 100644
index 00000000000..7572f7a15a9
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/Examples/README.md
@@ -0,0 +1,11 @@
+# Example Tasks
+
+Various AutoGen example tasks. Unlike other benchmark tasks, these tasks have no automated evaluation.
+
+## Running the tasks
+
+```
+autogenbench run Tasks/default_two_agents
+```
+
+Some tasks require a Bing API key. Edit the ENV.json file to provide a valid BING_API_KEY, or simply allow that task to fail (it is only required by one task).
diff --git a/samples/tools/autogenbench/scenarios/Examples/Tasks/default_three_agents.jsonl b/samples/tools/autogenbench/scenarios/Examples/Tasks/default_three_agents.jsonl
new file mode 100644
index 00000000000..a9a2537b1bf
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/Examples/Tasks/default_three_agents.jsonl
@@ -0,0 +1 @@
+{ "id": "nvda_tsla_stocks", "template": "../Templates/ThreeAgents", "substitutions": { "scenario.py": { "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD.", "__SELECTION_METHOD__": "auto", "__3RD_AGENT_NAME__": "visualization_critic", "__3RD_AGENT_PROMPT__": "A student of Edward Tufte, you are an expert in information design, and will provide helpful critiques of visualizations. As you prepare your critiques, please consider the following dimensions:\n- Are there bugs, logic errors, syntax error or typos in the visualization code? Are there any reasons why the code may fail to run? How should it be fixed?\n- Is the data transformed appropriately for the visualization type? E.g., is the dataset appropriated filtered, aggregated, or grouped  if needed? If a date field is used, is the date field first converted to a date object etc?\n- How well does the code meet the specified visualization goals?\n- CONSIDERING BEST PRACTICES, is the visualization type appropriate for the data and intent? Is there a visualization type that would be more effective in conveying insights? \n- Are the aesthetics of the visualization appropriate for the visualization type and the data?" } } }
diff --git a/samples/tools/autogenbench/scenarios/Examples/Tasks/default_two_agents.jsonl b/samples/tools/autogenbench/scenarios/Examples/Tasks/default_two_agents.jsonl
new file mode 100644
index 00000000000..78c3a12f7bd
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/Examples/Tasks/default_two_agents.jsonl
@@ -0,0 +1,3 @@
+{ "id": "nvda_tsla_stocks", "template": "../Templates/TwoAgents", "substitutions": { "scenario.py": { "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD." } } }
+{ "id": "arxiv_search", "template": "../Templates/TwoAgents", "substitutions": { "scenario.py": { "__PROMPT__": "Find 10 papers on explainable or interpretable AI that were submitted to arXiv within the last year. When printing results, include paper titles, authors, dates, and URLs, but not their abstracts." } } }
+{ "id": "old_mslogo_search", "template": "../Templates/TwoAgents", "substitutions": { "scenario.py": { "__PROMPT__": "Find Microsoft's logo from 1983, and save it to disk. If searching the web, use Bing with API key stored in os.environ['BING_API_KEY']" } } }
diff --git a/samples/tools/testbed/scenarios/Examples/Templates/ThreeAgents/scenario.py b/samples/tools/autogenbench/scenarios/Examples/Templates/ThreeAgents/scenario.py
similarity index 93%
rename from samples/tools/testbed/scenarios/Examples/Templates/ThreeAgents/scenario.py
rename to samples/tools/autogenbench/scenarios/Examples/Templates/ThreeAgents/scenario.py
index cbf383ebfe2..6aa2456deae 100644
--- a/samples/tools/testbed/scenarios/Examples/Templates/ThreeAgents/scenario.py
+++ b/samples/tools/autogenbench/scenarios/Examples/Templates/ThreeAgents/scenario.py
@@ -6,10 +6,7 @@
 testbed_utils.init()
 ##############################
 
-config_list = autogen.config_list_from_json(
-    "OAI_CONFIG_LIST",
-    filter_dict={"model": ["__MODEL__"]},
-)
+config_list = autogen.config_list_from_json("OAI_CONFIG_LIST")
 
 assistant = autogen.AssistantAgent(
     "assistant",
diff --git a/samples/tools/testbed/scenarios/Examples/Templates/TwoAgents/scenario.py b/samples/tools/autogenbench/scenarios/Examples/Templates/TwoAgents/scenario.py
similarity index 88%
rename from samples/tools/testbed/scenarios/Examples/Templates/TwoAgents/scenario.py
rename to samples/tools/autogenbench/scenarios/Examples/Templates/TwoAgents/scenario.py
index 6f736a5aaba..ae2682562b8 100644
--- a/samples/tools/testbed/scenarios/Examples/Templates/TwoAgents/scenario.py
+++ b/samples/tools/autogenbench/scenarios/Examples/Templates/TwoAgents/scenario.py
@@ -6,10 +6,7 @@
 testbed_utils.init()
 ##############################
 
-config_list = autogen.config_list_from_json(
-    "OAI_CONFIG_LIST",
-    filter_dict={"model": ["__MODEL__"]},
-)
+config_list = autogen.config_list_from_json("OAI_CONFIG_LIST")
 
 assistant = autogen.AssistantAgent(
     "assistant",
diff --git a/samples/tools/testbed/scenarios/Examples/Templates/TwoAgents/scenario_finalize.sh b/samples/tools/autogenbench/scenarios/Examples/Templates/TwoAgents/scenario_finalize.sh
similarity index 100%
rename from samples/tools/testbed/scenarios/Examples/Templates/TwoAgents/scenario_finalize.sh
rename to samples/tools/autogenbench/scenarios/Examples/Templates/TwoAgents/scenario_finalize.sh
diff --git a/samples/tools/testbed/scenarios/Examples/Templates/TwoAgents/scenario_init.sh b/samples/tools/autogenbench/scenarios/Examples/Templates/TwoAgents/scenario_init.sh
similarity index 100%
rename from samples/tools/testbed/scenarios/Examples/Templates/TwoAgents/scenario_init.sh
rename to samples/tools/autogenbench/scenarios/Examples/Templates/TwoAgents/scenario_init.sh
diff --git a/samples/tools/autogenbench/scenarios/GAIA/MANIFEST.json b/samples/tools/autogenbench/scenarios/GAIA/MANIFEST.json
new file mode 100644
index 00000000000..807ec57bdc3
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/GAIA/MANIFEST.json
@@ -0,0 +1,12 @@
+{
+    "files": {
+        "README.md": "README.md",
+        "Scripts/init_tasks.py": "Scripts/init_tasks.py",
+        "Scripts/custom_tabulate.py": "Scripts/custom_tabulate.py",
+        "Templates/BasicTwoAgents/expected_answer.txt": "Templates/BasicTwoAgents/expected_answer.txt",
+        "Templates/BasicTwoAgents/scenario.py": "Templates/BasicTwoAgents/scenario.py",
+        "Templates/SocietyOfMind/scenario.py": "Templates/SocietyOfMind/scenario.py",
+        "Templates/SocietyOfMind/expected_answer.txt": "Templates/SocietyOfMind/expected_answer.txt",
+        "Templates/SocietyOfMind/requirements.txt": "Templates/SocietyOfMind/requirements.txt"
+    }
+}
diff --git a/samples/tools/autogenbench/scenarios/GAIA/README.md b/samples/tools/autogenbench/scenarios/GAIA/README.md
new file mode 100644
index 00000000000..eb8e6b3339d
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/GAIA/README.md
@@ -0,0 +1,39 @@
+# GAIA Benchmark
+
+This scenario implements the [GAIA](https://arxiv.org/abs/2311.12983) agent benchmark.
+
+## Running the TwoAgents tasks
+
+Level 1 tasks:
+```sh
+autogenbench run Tasks/gaia_test_level_1__two_agents.jsonl
+autogenbench tabulate Results/gaia_test_level_1__two_agents
+```
+
+Level 2 and 3 tasks are executed similarly.
+
+## Running the SocietyOfMind tasks
+
+Running the SocietyOfMind tasks is similar to the TwoAgentTasks, but requires an `ENV.json` file
+with a working BING API key. This file should be located in the root current working directory
+from where you are running autogenbench, and should have at least the following contents:
+
+```json
+{
+    "BING_API_KEY": "Your_API_key"
+}
+```
+
+Once created, simply run:
+
+```sh
+autogenbench run Tasks/gaia_test_level_1__soc.jsonl
+autogenbench tabulate Results/gaia_test_level_1__soc
+```
+
+And similarly for level 2 and 3.
+
+## References
+**GAIA: a benchmark for General AI Assistants**<br/>
+Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom<br/>
+[https://arxiv.org/abs/2311.12983](https://arxiv.org/abs/2311.12983)
diff --git a/samples/tools/autogenbench/scenarios/GAIA/Scripts/custom_tabulate.py b/samples/tools/autogenbench/scenarios/GAIA/Scripts/custom_tabulate.py
new file mode 100644
index 00000000000..144f882af24
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/GAIA/Scripts/custom_tabulate.py
@@ -0,0 +1,49 @@
+import os
+import sys
+import json
+import re
+from autogenbench.tabulate_cmd import default_tabulate
+
+
+def normalize_answer(a):
+    # Lower case
+    # Trim (left and right)
+    # Replace multiple spaces with one space
+    # Remove trailing punctuation
+    return re.sub(r"[\.\!\?]+$", "", re.sub(r"\s+", " ", a.strip().lower()))
+
+
+def scorer(instance_dir):
+    # Read the expected answer
+    expected_answer_file = os.path.join(instance_dir, "expected_answer.txt")
+    if not os.path.isfile(expected_answer_file):
+        return None
+
+    expected_answer = None
+    with open(expected_answer_file, "rt") as fh:
+        expected_answer = fh.read().strip()
+
+    # Read the console
+    console_log_file = os.path.join(instance_dir, "console_log.txt")
+    if not os.path.isfile(console_log_file):
+        return None
+
+    console_log = ""
+    with open(console_log_file, "rt") as fh:
+        console_log = fh.read()
+
+        final_answer = ""
+        m = re.search(r"FINAL ANSWER:(.*?)\n", console_log, re.DOTALL)
+        if m:
+            final_answer = m.group(1).strip()
+
+        # Return true if they are equal after normalization
+        return normalize_answer(expected_answer) == normalize_answer(final_answer)
+
+
+def main(args):
+    default_tabulate(args, scorer=scorer)
+
+
+if __name__ == "__main__" and __package__ is None:
+    main(sys.argv)
diff --git a/samples/tools/autogenbench/scenarios/GAIA/Scripts/init_tasks.py b/samples/tools/autogenbench/scenarios/GAIA/Scripts/init_tasks.py
new file mode 100644
index 00000000000..3ff483af181
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/GAIA/Scripts/init_tasks.py
@@ -0,0 +1,152 @@
+#
+# Run this file to download the human_eval dataset, and create a corresponding testbed scenario:
+# (default: ../scenarios/human_eval_two_agents_gpt4.jsonl and ./scenarios/human_eval_two_agents_gpt35.jsonl)
+#
+
+import json
+import os
+import sys
+from huggingface_hub import snapshot_download
+
+SCRIPT_PATH = os.path.realpath(__file__)
+SCRIPT_NAME = os.path.basename(SCRIPT_PATH)
+SCRIPT_DIR = os.path.dirname(SCRIPT_PATH)
+
+SCENARIO_DIR = os.path.realpath(os.path.join(SCRIPT_DIR, os.path.pardir))
+TEMPLATES_DIR = os.path.join(SCENARIO_DIR, "Templates")
+TASKS_DIR = os.path.join(SCENARIO_DIR, "Tasks")
+DOWNLOADS_DIR = os.path.join(SCENARIO_DIR, "Downloads")
+REPO_DIR = os.path.join(DOWNLOADS_DIR, "GAIA")
+
+
+def download_gaia():
+    """Download the GAIA benchmark from Hugging Face."""
+
+    if not os.path.isdir(DOWNLOADS_DIR):
+        os.mkdir(DOWNLOADS_DIR)
+
+    """Download the GAIA dataset from Hugging Face Hub"""
+    snapshot_download(
+        repo_id="gaia-benchmark/GAIA",
+        repo_type="dataset",
+        local_dir=REPO_DIR,
+        local_dir_use_symlinks=True,
+    )
+
+
+def create_jsonl(name, tasks, files_dir, template):
+    """Creates a JSONL scenario file with a given name, and template path."""
+
+    if not os.path.isdir(TASKS_DIR):
+        os.mkdir(TASKS_DIR)
+
+    with open(os.path.join(TASKS_DIR, name + ".jsonl"), "wt") as fh:
+        for task in tasks:
+            print(f"Converting: [{name}] {task['task_id']}")
+
+            # Figure out what files we need to copy
+            template_cp_list = [template]
+            if len(task["file_name"].strip()) > 0:
+                template_cp_list.append(
+                    [
+                        os.path.join(files_dir, task["file_name"].strip()),
+                        os.path.join("coding", task["file_name"].strip()),
+                    ]
+                )
+
+            record = {
+                "id": task["task_id"],
+                "template": template_cp_list,
+                "substitutions": {
+                    "scenario.py": {
+                        "__FILE_NAME__": task["file_name"],
+                        "__PROMPT__": task["Question"],
+                    },
+                    "expected_answer.txt": {"__EXPECTED_ANSWER__": task["Final answer"]},
+                },
+            }
+
+            fh.write(json.dumps(record).strip() + "\n")
+
+
+###############################################################################
+def main():
+    download_gaia()
+
+    gaia_validation_files = os.path.join(REPO_DIR, "2023", "validation")
+    gaia_test_files = os.path.join(REPO_DIR, "2023", "test")
+
+    if not os.path.isdir(gaia_validation_files) or not os.path.isdir(gaia_test_files):
+        sys.exit(f"Error: '{REPO_DIR}' does not appear to be a copy of the GAIA repository.")
+
+    # Load the GAIA data
+    gaia_validation_tasks = [[], [], []]
+    with open(os.path.join(gaia_validation_files, "metadata.jsonl")) as fh:
+        for line in fh:
+            data = json.loads(line)
+            gaia_validation_tasks[data["Level"] - 1].append(data)
+
+    gaia_test_tasks = [[], [], []]
+    with open(os.path.join(gaia_test_files, "metadata.jsonl")) as fh:
+        for line in fh:
+            data = json.loads(line)
+
+            # A welcome message -- not a real task
+            if data["task_id"] == "0-0-0-0-0":
+                continue
+
+            gaia_test_tasks[data["Level"] - 1].append(data)
+
+    templates = {
+        "two_agents": os.path.join(TEMPLATES_DIR, "BasicTwoAgents"),
+        "soc": os.path.join(TEMPLATES_DIR, "SocietyOfMind"),
+    }
+
+    # Add coding directories if needed (these are usually empty and left out of the repo)
+    for template in templates.values():
+        code_dir_path = os.path.join(template, "coding")
+        if not os.path.isdir(code_dir_path):
+            os.mkdir(code_dir_path)
+
+    # Create the various combinations of [models] x [templates]
+    for t in templates.items():
+        create_jsonl(
+            f"gaia_validation_level_1__{t[0]}",
+            gaia_validation_tasks[0],
+            gaia_validation_files,
+            t[1],
+        )
+        create_jsonl(
+            f"gaia_validation_level_2__{t[0]}",
+            gaia_validation_tasks[1],
+            gaia_validation_files,
+            t[1],
+        )
+        create_jsonl(
+            f"gaia_validation_level_3__{t[0]}",
+            gaia_validation_tasks[2],
+            gaia_validation_files,
+            t[1],
+        )
+        create_jsonl(
+            f"gaia_test_level_1__{t[0]}",
+            gaia_test_tasks[0],
+            gaia_test_files,
+            t[1],
+        )
+        create_jsonl(
+            f"gaia_test_level_2__{t[0]}",
+            gaia_test_tasks[1],
+            gaia_test_files,
+            t[1],
+        )
+        create_jsonl(
+            f"gaia_test_level_3__{t[0]}",
+            gaia_test_tasks[2],
+            gaia_test_files,
+            t[1],
+        )
+
+
+if __name__ == "__main__" and __package__ is None:
+    main()
diff --git a/samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/expected_answer.txt b/samples/tools/autogenbench/scenarios/GAIA/Templates/BasicTwoAgents/expected_answer.txt
similarity index 100%
rename from samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/expected_answer.txt
rename to samples/tools/autogenbench/scenarios/GAIA/Templates/BasicTwoAgents/expected_answer.txt
diff --git a/samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py b/samples/tools/autogenbench/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py
similarity index 94%
rename from samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py
rename to samples/tools/autogenbench/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py
index f96f88364c4..5ca7b0a2814 100644
--- a/samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py
+++ b/samples/tools/autogenbench/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py
@@ -27,11 +27,7 @@
     """.strip()
 )
 
-
-config_list = autogen.config_list_from_json(
-    "OAI_CONFIG_LIST",
-    filter_dict={"model": ["__MODEL__"]},
-)
+config_list = autogen.config_list_from_json("OAI_CONFIG_LIST")
 
 assistant = autogen.AssistantAgent(
     "assistant",
@@ -57,7 +53,7 @@
 """.strip()
 
 if len(filename) > 0:
-    question = f"Consider the file '{filename}', which can be read from the current working directory. {question}"
+    question = f"Consider the file '{filename}', which can be read from the current working directory. If you need to read or write it, output python code in a code block (```python) to do so. {question}"
 
 user_proxy.initiate_chat(assistant, message=question)
 
diff --git a/samples/tools/autogenbench/scenarios/GAIA/Templates/SocietyOfMind/expected_answer.txt b/samples/tools/autogenbench/scenarios/GAIA/Templates/SocietyOfMind/expected_answer.txt
new file mode 100644
index 00000000000..8153c2bf824
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/GAIA/Templates/SocietyOfMind/expected_answer.txt
@@ -0,0 +1 @@
+__EXPECTED_ANSWER__
diff --git a/samples/tools/autogenbench/scenarios/GAIA/Templates/SocietyOfMind/requirements.txt b/samples/tools/autogenbench/scenarios/GAIA/Templates/SocietyOfMind/requirements.txt
new file mode 100644
index 00000000000..ec2725c530d
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/GAIA/Templates/SocietyOfMind/requirements.txt
@@ -0,0 +1,4 @@
+git+https://github.com/microsoft/autogen.git@society_of_mind_gaia
+pdfminer.six
+markdownify
+pathvalidate
diff --git a/samples/tools/autogenbench/scenarios/GAIA/Templates/SocietyOfMind/scenario.py b/samples/tools/autogenbench/scenarios/GAIA/Templates/SocietyOfMind/scenario.py
new file mode 100644
index 00000000000..129c898e47f
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/GAIA/Templates/SocietyOfMind/scenario.py
@@ -0,0 +1,182 @@
+# ruff: noqa: E722
+import os
+import sys
+import json
+import autogen
+import copy
+import traceback
+from datetime import datetime
+import testbed_utils
+from autogen.agentchat.contrib.web_surfer import WebSurferAgent
+from autogen.agentchat.contrib.society_of_mind_agent import SocietyOfMindAgent
+from autogen.agentchat.contrib.group_chat_moderator import GroupChatModerator
+from autogen.token_count_utils import count_token, get_max_token_limit
+
+testbed_utils.init()
+##############################
+
+config_list = autogen.config_list_from_json(
+    "OAI_CONFIG_LIST",
+    filter_dict={"model": ["gpt-4"]},
+)
+llm_config = testbed_utils.default_llm_config(config_list, timeout=180)
+llm_config["temperature"] = 0.1
+
+summarizer_config_list = autogen.config_list_from_json(
+    "OAI_CONFIG_LIST",
+    filter_dict={"model": ["gpt-3.5-turbo-16k"]},
+)
+summarizer_llm_config = testbed_utils.default_llm_config(summarizer_config_list, timeout=180)
+summarizer_llm_config["temperature"] = 0.1
+
+final_config_list = autogen.config_list_from_json(
+    "OAI_CONFIG_LIST",
+    filter_dict={"model": ["gpt-4-1106-preview"]},
+)
+final_llm_config = testbed_utils.default_llm_config(final_config_list, timeout=180)
+final_llm_config["temperature"] = 0.1
+
+
+client = autogen.OpenAIWrapper(**final_llm_config)
+
+
+def response_preparer(inner_messages):
+    tokens = 0
+
+    messages = [
+        {
+            "role": "user",
+            "content": """Earlier you were asked the following:
+
+__PROMPT__
+
+Your team then worked diligently to address that request. Here is a transcript of that conversation:""",
+        }
+    ]
+    tokens += count_token(messages[-1])
+
+    # The first message just repeats the question, so remove it
+    if len(inner_messages) > 1:
+        del inner_messages[0]
+
+    # copy them to this context
+    for message in inner_messages:
+        message = copy.deepcopy(message)
+        message["role"] = "user"
+        messages.append(message)
+        tokens += count_token(messages[-1])
+
+    messages.append(
+        {
+            "role": "user",
+            "content": """
+Read the above conversation and output a FINAL ANSWER to the question. The question is repeated here for convenience:
+
+__PROMPT__
+
+To output the final answer, use the following template: FINAL ANSWER: [YOUR FINAL ANSWER]
+YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings.
+If you are asked for a number, don’t use comma to write your number neither use units such as $ or percent sign unless specified otherwise, and don't output any final sentence punctuation such as '.', '!', or '?'.
+If you are asked for a string, don’t use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise.
+If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.""",
+        }
+    )
+    tokens += count_token(messages[-1])
+
+    #    # Hardcoded
+    #    while tokens > 3200:
+    #        mid = int(len(messages) / 2)  # Remove from the middle
+    #        tokens -= count_token(messages[mid])
+    #        del messages[mid]
+
+    response = client.create(context=None, messages=messages)
+    extracted_response = client.extract_text_or_completion_object(response)[0]
+    if not isinstance(extracted_response, str):
+        return str(extracted_response.model_dump(mode="dict"))  # Not sure what to do here
+    else:
+        return extracted_response
+
+
+assistant = autogen.AssistantAgent(
+    "assistant",
+    is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
+    llm_config=llm_config,
+)
+user_proxy = autogen.UserProxyAgent(
+    "computer_terminal",
+    human_input_mode="NEVER",
+    description="A computer terminal that performs no other action than running Python scripts (provided to it quoted in ```python code blocks), or sh shell scripts (provided to it quoted in ```sh code blocks)",
+    is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
+    code_execution_config={
+        "work_dir": "coding",
+        "use_docker": False,
+    },
+    default_auto_reply="",
+    max_consecutive_auto_reply=15,
+)
+
+user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0"
+
+web_surfer = WebSurferAgent(
+    "web_surfer",
+    llm_config=llm_config,
+    summarizer_llm_config=summarizer_llm_config,
+    is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
+    browser_config={
+        "bing_api_key": os.environ["BING_API_KEY"],
+        "viewport_size": 1024 * 5,
+        "downloads_folder": "coding",
+        "request_kwargs": {
+            "headers": {"User-Agent": user_agent},
+        },
+    },
+)
+
+filename_prompt = "__FILE_NAME__".strip()
+if len(filename_prompt) > 0:
+    filename_prompt = f"Consider the file '{filename_prompt}' which can be read from the current working directory. If you need to read or write it, output python code in a code block (```python) to do so. "
+
+
+question = f"""
+Below I will pose a question to you that I would like you to answer. You should begin by listing all the relevant facts necessary to derive an answer, then fill in those facts from memory where possible, including specific names, numbers and statistics. You are Ken Jennings-level with trivia, and Mensa-level with puzzles, so there should be a deep well to draw from. After listing the facts, begin to solve the question in earnest. Here is the question:
+
+{filename_prompt}__PROMPT__
+""".strip()
+
+groupchat = GroupChatModerator(
+    agents=[user_proxy, assistant, web_surfer],
+    first_speaker=assistant,
+    max_round=30,
+    messages=[],
+    speaker_selection_method="auto",
+    allow_repeat_speaker=[web_surfer, assistant],
+)
+
+manager = autogen.GroupChatManager(
+    groupchat=groupchat,
+    is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
+    send_introductions=True,
+    llm_config=llm_config,
+)
+
+soc = SocietyOfMindAgent(
+    "gaia_agent",
+    chat_manager=manager,
+    response_preparer=response_preparer,
+    llm_config=llm_config,
+)
+
+try:
+    # Initiate one turn of the conversation
+    user_proxy.send(
+        question,
+        soc,
+        request_reply=True,
+        silent=False,
+    )
+except:
+    traceback.print_exc()
+
+
+##############################
+testbed_utils.finalize(agents=[soc, assistant, user_proxy, web_surfer, manager])
diff --git a/samples/tools/autogenbench/scenarios/HumanEval/MANIFEST.json b/samples/tools/autogenbench/scenarios/HumanEval/MANIFEST.json
new file mode 100644
index 00000000000..c6946de003a
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/HumanEval/MANIFEST.json
@@ -0,0 +1,10 @@
+{
+    "files": {
+        "Templates/TwoAgents/prompt.txt": "Templates/TwoAgents/prompt.txt",
+        "Templates/TwoAgents/coding/my_tests.py": "Templates/TwoAgents/coding/my_tests.py",
+        "Templates/TwoAgents/scenario.py": "Templates/TwoAgents/scenario.py",
+        "README.md": "README.md",
+	"Scripts/init_tasks.py": "Scripts/init_tasks.py",
+	"Scripts/custom_tabulate.py": "Scripts/custom_tabulate.py"
+    }
+}
diff --git a/samples/tools/autogenbench/scenarios/HumanEval/README.md b/samples/tools/autogenbench/scenarios/HumanEval/README.md
new file mode 100644
index 00000000000..61a33fb0562
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/HumanEval/README.md
@@ -0,0 +1,21 @@
+# HumanEval Benchmark
+
+This scenario implements a modified version of the [HumanEval](https://arxiv.org/abs/2107.03374) benchmark.
+Compared to the original benchmark, there are **two key differences** here:
+
+- A chat model rather than a completion model is used.
+- The agents get pass/fail feedback about their implementations, and can keep trying until they succeed or run out of tokens or turns.
+
+## Running the tasks
+
+```
+autogenbench run Tasks/human_eval_two_agents.jsonl
+autogenbench tabulate Results/human_eval_two_agents
+```
+
+For faster development and iteration, a reduced HumanEval set is available via `Tasks/r_human_eval_two_agents.jsonl`, and contains only 26 problems of varying difficulty.
+
+## References
+**Evaluating Large Language Models Trained on Code**<br/>
+Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba<br/>
+[https://arxiv.org/abs/2107.03374](https://arxiv.org/abs/2107.03374)
diff --git a/samples/tools/autogenbench/scenarios/HumanEval/Scripts/custom_tabulate.py b/samples/tools/autogenbench/scenarios/HumanEval/Scripts/custom_tabulate.py
new file mode 100644
index 00000000000..ba8700d1f47
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/HumanEval/Scripts/custom_tabulate.py
@@ -0,0 +1,11 @@
+import os
+import sys
+from autogenbench.tabulate_cmd import default_tabulate
+
+
+def main(args):
+    default_tabulate(args)
+
+
+if __name__ == "__main__" and __package__ is None:
+    main(sys.argv)
diff --git a/samples/tools/testbed/utils/download_humaneval.py b/samples/tools/autogenbench/scenarios/HumanEval/Scripts/init_tasks.py
similarity index 75%
rename from samples/tools/testbed/utils/download_humaneval.py
rename to samples/tools/autogenbench/scenarios/HumanEval/Scripts/init_tasks.py
index c967b5342a3..799ac7b170c 100644
--- a/samples/tools/testbed/utils/download_humaneval.py
+++ b/samples/tools/autogenbench/scenarios/HumanEval/Scripts/init_tasks.py
@@ -69,21 +69,25 @@ def download_human_eval():
     return results
 
 
-def create_jsonl(name, tasks, template, model):
-    """Creates a JSONL scenario file with a given name, list of HumanEval tasks, template path, and model."""
+def create_jsonl(name, tasks, template):
+    """Creates a JSONL scenario file with a given name, list of HumanEval tasks, and template path."""
 
-    scenarios_dir = os.path.realpath(os.path.join(SCRIPT_DIR, os.path.pardir, "scenarios", "HumanEval"))
+    # Create a task directory if it doesn't exist
+    scenario_dir = os.path.realpath(os.path.join(SCRIPT_DIR, os.path.pardir))
+    task_dir = os.path.join(scenario_dir, "Tasks")
+    if not os.path.isdir(task_dir):
+        os.mkdir(task_dir)
 
-    with open(os.path.join(scenarios_dir, name + ".jsonl"), "wt") as fh:
+    # Create the jsonl file
+    with open(os.path.join(task_dir, name + ".jsonl"), "wt") as fh:
         for task in tasks:
             print(f"Converting: [{name}] {task['task_id']}")
 
             record = {
                 "id": task["task_id"].replace("/", "_"),
-                "template": template,
+                "template": os.path.join(os.path.pardir, template),
                 "substitutions": {
                     "scenario.py": {
-                        "__MODEL__": model,
                         "__ENTRY_POINT__": task["entry_point"],
                         "__SELECTION_METHOD__": "auto",
                     },
@@ -96,24 +100,22 @@ def create_jsonl(name, tasks, template, model):
 
 
 ###############################################################################
-if __name__ == "__main__":
+def main():
     human_eval = download_human_eval()
     reduced_human_eval = [t for t in human_eval if t["task_id"] in REDUCED_SET]
 
-    models = {
-        "gpt4": "gpt-4",
-        "gpt35": "gpt-3.5-turbo-16k",
-    }
-
     templates = {
         "two_agents": "Templates/TwoAgents",
-        "gc3_distractor": "Templates/GroupChatThreeAgents_Distractor",
-        "gc3_guardrails": "Templates/GroupChatThreeAgents_Guardrails",
-        "gc4": "Templates/GroupChatFourAgents",
+        # "gc3_distractor": "Templates/GroupChatThreeAgents_Distractor",
+        # "gc3_guardrails": "Templates/GroupChatThreeAgents_Guardrails",
+        # "gc4": "Templates/GroupChatFourAgents",
     }
 
     # Create the various combinations of [models] x [templates]
-    for m in models.items():
-        for t in templates.items():
-            create_jsonl(f"human_eval_{t[0]}_{m[0]}", human_eval, t[1], m[1])
-            create_jsonl(f"r_human_eval_{t[0]}_{m[0]}", reduced_human_eval, t[1], m[1])
+    for t in templates.items():
+        create_jsonl(f"human_eval_{t[0]}", human_eval, t[1])
+        create_jsonl(f"r_human_eval_{t[0]}", reduced_human_eval, t[1])
+
+
+if __name__ == "__main__" and __package__ is None:
+    main()
diff --git a/samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatFourAgents/coding/my_tests.py b/samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatFourAgents/coding/my_tests.py
similarity index 100%
rename from samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatFourAgents/coding/my_tests.py
rename to samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatFourAgents/coding/my_tests.py
diff --git a/samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatFourAgents/prompt.txt b/samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatFourAgents/prompt.txt
similarity index 100%
rename from samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatFourAgents/prompt.txt
rename to samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatFourAgents/prompt.txt
diff --git a/samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatFourAgents/scenario.py b/samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatFourAgents/scenario.py
similarity index 97%
rename from samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatFourAgents/scenario.py
rename to samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatFourAgents/scenario.py
index c49166ce7eb..dacf7b83906 100644
--- a/samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatFourAgents/scenario.py
+++ b/samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatFourAgents/scenario.py
@@ -20,10 +20,7 @@
     PROMPT = fh.read()
 
 # Ok, now get autogen to solve it.
-config_list = autogen.config_list_from_json(
-    "OAI_CONFIG_LIST",
-    filter_dict={"model": ["__MODEL__"]},
-)
+config_list = autogen.config_list_from_json("OAI_CONFIG_LIST")
 
 assistant = autogen.AssistantAgent(
     "coder",
diff --git a/samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/coding/my_tests.py b/samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/coding/my_tests.py
similarity index 100%
rename from samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/coding/my_tests.py
rename to samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/coding/my_tests.py
diff --git a/samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/prompt.txt b/samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/prompt.txt
similarity index 100%
rename from samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/prompt.txt
rename to samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/prompt.txt
diff --git a/samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/scenario.py b/samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/scenario.py
similarity index 96%
rename from samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/scenario.py
rename to samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/scenario.py
index ef8d339429a..06b88638d47 100644
--- a/samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/scenario.py
+++ b/samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Distractor/scenario.py
@@ -20,10 +20,7 @@
     PROMPT = fh.read()
 
 # Ok, now get autogen to solve it.
-config_list = autogen.config_list_from_json(
-    "OAI_CONFIG_LIST",
-    filter_dict={"model": ["__MODEL__"]},
-)
+config_list = autogen.config_list_from_json("OAI_CONFIG_LIST")
 
 assistant = autogen.AssistantAgent(
     "coder",
diff --git a/samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/coding/my_tests.py b/samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/coding/my_tests.py
similarity index 100%
rename from samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/coding/my_tests.py
rename to samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/coding/my_tests.py
diff --git a/samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/prompt.txt b/samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/prompt.txt
similarity index 100%
rename from samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/prompt.txt
rename to samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/prompt.txt
diff --git a/samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/scenario.py b/samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/scenario.py
similarity index 97%
rename from samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/scenario.py
rename to samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/scenario.py
index 258fdcecc2a..a4708a55853 100644
--- a/samples/tools/testbed/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/scenario.py
+++ b/samples/tools/autogenbench/scenarios/HumanEval/Templates/GroupChatThreeAgents_Guardrails/scenario.py
@@ -20,10 +20,7 @@
     PROMPT = fh.read()
 
 # Ok, now get autogen to solve it.
-config_list = autogen.config_list_from_json(
-    "OAI_CONFIG_LIST",
-    filter_dict={"model": ["__MODEL__"]},
-)
+config_list = autogen.config_list_from_json("OAI_CONFIG_LIST")
 
 assistant = autogen.AssistantAgent(
     "coder",
diff --git a/samples/tools/autogenbench/scenarios/HumanEval/Templates/TwoAgents/coding/my_tests.py b/samples/tools/autogenbench/scenarios/HumanEval/Templates/TwoAgents/coding/my_tests.py
new file mode 100644
index 00000000000..d93c24296e2
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/HumanEval/Templates/TwoAgents/coding/my_tests.py
@@ -0,0 +1,13 @@
+# Disable ruff linter for template files
+# ruff: noqa: F821 E722
+
+__TEST__
+
+
+def run_tests(candidate):
+    try:
+        check(candidate)
+        # We can search for this string in the output
+        print("ALL TESTS PASSED !#!#\nTERMINATE")
+    except:
+        print("SOME TESTS FAILED - TRY AGAIN !#!#")
diff --git a/samples/tools/testbed/scenarios/HumanEval/Templates/TwoAgents/prompt.txt b/samples/tools/autogenbench/scenarios/HumanEval/Templates/TwoAgents/prompt.txt
similarity index 100%
rename from samples/tools/testbed/scenarios/HumanEval/Templates/TwoAgents/prompt.txt
rename to samples/tools/autogenbench/scenarios/HumanEval/Templates/TwoAgents/prompt.txt
diff --git a/samples/tools/testbed/scenarios/HumanEval/Templates/TwoAgents/scenario.py b/samples/tools/autogenbench/scenarios/HumanEval/Templates/TwoAgents/scenario.py
similarity index 94%
rename from samples/tools/testbed/scenarios/HumanEval/Templates/TwoAgents/scenario.py
rename to samples/tools/autogenbench/scenarios/HumanEval/Templates/TwoAgents/scenario.py
index d47a0945888..2d0815f6aac 100644
--- a/samples/tools/testbed/scenarios/HumanEval/Templates/TwoAgents/scenario.py
+++ b/samples/tools/autogenbench/scenarios/HumanEval/Templates/TwoAgents/scenario.py
@@ -20,10 +20,7 @@
     PROMPT = fh.read()
 
 # Ok, now get autogen to solve it.
-config_list = autogen.config_list_from_json(
-    "OAI_CONFIG_LIST",
-    filter_dict={"model": ["__MODEL__"]},
-)
+config_list = autogen.config_list_from_json("OAI_CONFIG_LIST")
 
 assistant = autogen.AssistantAgent(
     "assistant",
diff --git a/samples/tools/autogenbench/scenarios/MANIFEST.json b/samples/tools/autogenbench/scenarios/MANIFEST.json
new file mode 100644
index 00000000000..d70993ecfe6
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/MANIFEST.json
@@ -0,0 +1,8 @@
+{
+    "scenarios": {
+        "HumanEval": "HumanEval/",
+        "GAIA": "GAIA/",
+        "AutoGPT": "AutoGPT/",
+        "MATH": "MATH/"
+    }
+}
diff --git a/samples/tools/autogenbench/scenarios/MATH/MANIFEST.json b/samples/tools/autogenbench/scenarios/MATH/MANIFEST.json
new file mode 100644
index 00000000000..d1985a5dd6f
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/MATH/MANIFEST.json
@@ -0,0 +1,11 @@
+{
+    "files": {
+        "README.md": "README.md",
+        "Scripts/init_tasks.py": "Scripts/init_tasks.py",
+        "Scripts/custom_tabulate.py": "Scripts/custom_tabulate.py",
+        "Templates/TwoAgents/prompt.txt": "Templates/TwoAgents/prompt.txt",
+        "Templates/TwoAgents/expected_answer.txt": "Templates/TwoAgents/expected_answer.txt",
+        "Templates/TwoAgents/scenario.py": "Templates/TwoAgents/scenario.py",
+        "Templates/TwoAgents/scenario_init.sh": "Templates/TwoAgents/scenario_init.sh"
+    }
+}
diff --git a/samples/tools/autogenbench/scenarios/MATH/README.md b/samples/tools/autogenbench/scenarios/MATH/README.md
new file mode 100644
index 00000000000..ac0680351e1
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/MATH/README.md
@@ -0,0 +1,19 @@
+# MATH Benchmark
+
+This scenario implements the [MATH](https://arxiv.org/abs/2103.03874) benchmark.
+
+## Running the tasks
+
+```
+autogenbench run Tasks/math_two_agents.jsonl
+autogenbench tabulate Results/math_two_agents
+```
+
+By default, only a small subset (17 of 5000) MATH problems are exposed. Edit `Scripts/init_tasks.py` to expose more tasks.
+
+*Note*: Scoring is done by prompting the LLM (ideally GPT-4) with both the proposed answer and the ground truth answer, and asking the LLM to grade itself.
+
+## References
+**Measuring Mathematical Problem Solving With the MATH Dataset**<br/>
+Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt<br/>
+[https://arxiv.org/abs/2103.03874](https://arxiv.org/abs/2103.03874)
diff --git a/samples/tools/autogenbench/scenarios/MATH/Scripts/custom_tabulate.py b/samples/tools/autogenbench/scenarios/MATH/Scripts/custom_tabulate.py
new file mode 100644
index 00000000000..2571145dbff
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/MATH/Scripts/custom_tabulate.py
@@ -0,0 +1,26 @@
+import os
+import sys
+import json
+from autogenbench.tabulate_cmd import default_tabulate
+
+
+def scorer(instance_dir):
+    checker_messages = os.path.join(instance_dir, "checker_messages.json")
+    if os.path.isfile(checker_messages):
+        with open(checker_messages, "rt") as fh:
+            messages = json.loads(fh.read())["checker_proxy"]
+            results = messages[-1]["content"].lower()
+            if "the answer is correct" in results or "the answer is approximated but should be correct" in results:
+                return True
+            else:
+                return False
+    else:
+        return None
+
+
+def main(args):
+    default_tabulate(args, scorer=scorer)
+
+
+if __name__ == "__main__" and __package__ is None:
+    main(sys.argv)
diff --git a/samples/tools/autogenbench/scenarios/MATH/Scripts/init_tasks.py b/samples/tools/autogenbench/scenarios/MATH/Scripts/init_tasks.py
new file mode 100644
index 00000000000..16545c8e5d0
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/MATH/Scripts/init_tasks.py
@@ -0,0 +1,117 @@
+#
+# Run this file to download the human_eval dataset, and create a corresponding testbed scenario:
+# (default: ../scenarios/human_eval_two_agents_gpt4.jsonl and ./scenarios/human_eval_two_agents_gpt35.jsonl)
+#
+
+import requests
+import tarfile
+import io
+import json
+import os
+import sys
+
+URL = "https://people.eecs.berkeley.edu/~hendrycks/MATH.tar"
+
+SCRIPT_PATH = os.path.realpath(__file__)
+SCRIPT_NAME = os.path.basename(SCRIPT_PATH)
+SCRIPT_DIR = os.path.dirname(SCRIPT_PATH)
+
+SCENARIO_DIR = os.path.realpath(os.path.join(SCRIPT_DIR, os.path.pardir))
+TEMPLATES_DIR = os.path.join(SCENARIO_DIR, "Templates")
+TASKS_DIR = os.path.join(SCENARIO_DIR, "Tasks")
+DOWNLOADS_DIR = os.path.join(SCENARIO_DIR, "Downloads")
+
+SELECTED_PROBLEMS = [
+    "MATH/test/algebra/2144.json",
+    "MATH/test/algebra/1997.json",
+    "MATH/test/algebra/2072.json",
+    "MATH/test/algebra/2137.json",
+    "MATH/test/algebra/2557.json",
+    "MATH/test/algebra/2045.json",
+    "MATH/test/algebra/2499.json",
+    "MATH/test/counting_and_probability/483.json",
+    "MATH/test/intermediate_algebra/590.json",
+    "MATH/test/prealgebra/1511.json",
+    "MATH/test/intermediate_algebra/935.json",
+    "MATH/test/prealgebra/808.json",
+    "MATH/test/number_theory/233.json",
+    "MATH/test/number_theory/960.json",
+    "MATH/test/precalculus/551.json",
+    "MATH/test/counting_and_probability/909.json",
+    "MATH/test/algebra/2417.json",
+]
+
+
+def download_math():
+    """Download the MATH dataset (if not already downloaded).
+    Return a JSON dictionary of selected problems."""
+
+    selected_problems = dict()
+
+    if not os.path.isdir(DOWNLOADS_DIR):
+        os.mkdir(DOWNLOADS_DIR)
+
+    tar_file = os.path.join(DOWNLOADS_DIR, "MATH.tar")
+
+    if not os.path.isfile(tar_file):
+        # Send a HTTP request to the URL
+        response = requests.get(URL, stream=True)
+        response.raise_for_status()
+
+        # If the HTTP request returns a status code 200, proceed
+        with open(tar_file, "wb") as fh:
+            for chunk in response.iter_content(chunk_size=512):
+                fh.write(chunk)
+
+    # Extract selected problems
+    tar = tarfile.open(tar_file)
+    for member in tar.getmembers():
+        if member.name in SELECTED_PROBLEMS:
+            print(f"Extracting: {member.name}")
+            content = tar.extractfile(member).read()
+            selected_problems[member.name] = json.loads(content)
+
+    return selected_problems
+
+
+def create_jsonl(name, problems, template):
+    """Creates a JSONL scenario file with a given name, dictionary of MATH problems, and template path."""
+
+    # Create a task directory if it doesn't exist
+    if not os.path.isdir(TASKS_DIR):
+        os.mkdir(TASKS_DIR)
+
+    # Create the jsonl file
+    with open(os.path.join(TASKS_DIR, name + ".jsonl"), "wt") as fh:
+        for item in problems.items():
+            data = item[1]
+
+            task_id = item[0].replace("MATH/", "").replace(".json", "").replace("/", "_")
+            print(f"Converting: [{item[0]}] {task_id}")
+
+            record = {
+                "id": task_id,
+                "template": os.path.join(os.path.pardir, template),
+                "substitutions": {
+                    "prompt.txt": {"__PROMPT__": data["problem"]},
+                    "expected_answer.txt": {"__ANSWER__": data["solution"]},
+                },
+            }
+
+            fh.write(json.dumps(record).strip() + "\n")
+
+
+###############################################################################
+def main():
+    problems = download_math()
+
+    templates = {
+        "two_agents": "Templates/TwoAgents",
+    }
+
+    for t in templates.items():
+        create_jsonl(f"math_{t[0]}", problems, t[1])
+
+
+if __name__ == "__main__" and __package__ is None:
+    main()
diff --git a/samples/tools/testbed/scenarios/MATH/answer.txt b/samples/tools/autogenbench/scenarios/MATH/Templates/TwoAgents/expected_answer.txt
similarity index 100%
rename from samples/tools/testbed/scenarios/MATH/answer.txt
rename to samples/tools/autogenbench/scenarios/MATH/Templates/TwoAgents/expected_answer.txt
diff --git a/samples/tools/testbed/scenarios/MATH/prompt.txt b/samples/tools/autogenbench/scenarios/MATH/Templates/TwoAgents/prompt.txt
similarity index 100%
rename from samples/tools/testbed/scenarios/MATH/prompt.txt
rename to samples/tools/autogenbench/scenarios/MATH/Templates/TwoAgents/prompt.txt
diff --git a/samples/tools/testbed/scenarios/MATH/scenario.py b/samples/tools/autogenbench/scenarios/MATH/Templates/TwoAgents/scenario.py
similarity index 76%
rename from samples/tools/testbed/scenarios/MATH/scenario.py
rename to samples/tools/autogenbench/scenarios/MATH/Templates/TwoAgents/scenario.py
index 89cdfad1aee..b9d92a33528 100644
--- a/samples/tools/testbed/scenarios/MATH/scenario.py
+++ b/samples/tools/autogenbench/scenarios/MATH/Templates/TwoAgents/scenario.py
@@ -1,44 +1,39 @@
 import os
 import json
 import autogen
-
 import testbed_utils
 
 testbed_utils.init()
 
-
 PROMPT = ""
 with open("prompt.txt", "rt") as fh:
     PROMPT = fh.read()
 
 ANSWER = ""
-with open("answer.txt", "rt") as fh:
+with open("expected_answer.txt", "rt") as fh:
     ANSWER = fh.read()
 
 
 ####################
-config_list = autogen.config_list_from_json(
-    "OAI_CONFIG_LIST",
-    filter_dict={"model": ["gpt40613"]},
+config_list = autogen.config_list_from_json("OAI_CONFIG_LIST")
+llm_config = testbed_utils.default_llm_config(config_list, timeout=180)
+
+assistant = autogen.AssistantAgent(
+    "assistant",
+    llm_config=llm_config,
+    is_termination_msg=lambda x: x.get("content", "").find("TERMINATE") >= 0,
 )
-llm_config = {
-    "cache_seed": 42,
-    "config_list": config_list,
-    "timeout": 600,
-}
-code_execution_config = {
-    "work_dir": "coding",
-    "use_docker": False,  # set to True or image name like "python:3" to use docker
-}
-# ---------between "user" and "assistant"---------
-assistant = autogen.AssistantAgent(name="assistant", llm_config=llm_config)
+
 user_proxy = autogen.UserProxyAgent(
-    name="user",
+    "user_proxy",
     human_input_mode="NEVER",
-    code_execution_config=code_execution_config,
+    is_termination_msg=lambda x: x.get("content", "").find("TERMINATE") >= 0,
+    code_execution_config={
+        "work_dir": "coding",
+        "use_docker": False,
+    },
     max_consecutive_auto_reply=10,
-    is_termination_msg=lambda x: x.get("content", "")
-    and (x.get("content", "").rstrip().endswith("TERMINATE") or x.get("content", "").rstrip().endswith("TERMINATE.")),
+    default_auto_reply="TERMINATE",
 )
 
 user_proxy.initiate_chat(assistant, message=PROMPT)
@@ -76,10 +71,14 @@
 
 answer_checker = autogen.AssistantAgent(name="checker", llm_config=llm_config, system_message=check_sys_msg)
 checker_proxy = autogen.UserProxyAgent(
-    name="checker_proxy",
+    "checker_proxy",
     human_input_mode="NEVER",
-    code_execution_config=code_execution_config,
+    code_execution_config={
+        "work_dir": "coding",
+        "use_docker": False,
+    },
     max_consecutive_auto_reply=5,
+    default_auto_reply="TERMINATE",
     is_termination_msg=lambda x: x.get("content", "").lower()
     and (
         "the answer is correct" in x.get("content", "").lower()
diff --git a/samples/tools/autogenbench/scenarios/MATH/Templates/TwoAgents/scenario_init.sh b/samples/tools/autogenbench/scenarios/MATH/Templates/TwoAgents/scenario_init.sh
new file mode 100644
index 00000000000..d85f27cf0ec
--- /dev/null
+++ b/samples/tools/autogenbench/scenarios/MATH/Templates/TwoAgents/scenario_init.sh
@@ -0,0 +1 @@
+pip install sympy matplotlib numpy
diff --git a/samples/tools/autogenbench/setup.py b/samples/tools/autogenbench/setup.py
new file mode 100644
index 00000000000..606849326a4
--- /dev/null
+++ b/samples/tools/autogenbench/setup.py
@@ -0,0 +1,3 @@
+from setuptools import setup
+
+setup()
diff --git a/samples/tools/testbed/README.md b/samples/tools/testbed/README.md
deleted file mode 100644
index fa4f87404ea..00000000000
--- a/samples/tools/testbed/README.md
+++ /dev/null
@@ -1,239 +0,0 @@
-# Autogen Testbed Environment
-
-The Autogen Testbed environment is a tool for repeatedly running a set of pre-defined Autogen scenarios in a setting with tightly-controlled initial conditions. With each run, Autogen will start from a blank slate, working out what code needs to be written, and what libraries or dependencies to install. The results of each run are logged, and can be ingested by analysis or metrics scripts (see the HumanEval example later in this README). By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.
-
-This Testbed sample has been tested in, and is known to work with, Autogen versions 0.1.14 and 0.2.0
-
-## Setup
-
-Before you begin, you must configure your API keys for use with the Testbed. As with other Autogen applications, the Testbed will look for the OpenAI keys in a file in the current working directory, or environment variable named, OAI_CONFIG_LIST. This can be overridden using a command-line parameter described later.
-
-For some scenarios, additional keys may be required (e.g., keys for the Bing Search API). These can be added to an `ENV` file in the `includes` folder. A sample has been provided in ``includes/ENV.example``. Edit ``includes/ENV`` as needed.
-
-The Testbed also requires Docker (Desktop or Engine) AND the __python docker__ library. **It will not run in codespaces**, unless you opt for native execution (with is strongly discouraged). To install Docker Desktop see [https://www.docker.com/products/docker-desktop/](https://www.docker.com/products/docker-desktop/). To install the Python library:
-
-``pip install docker``
-
-## Running the Testbed
-
-To run the Testbed, simply execute
-``python run_scenarios.py scenarios/Examples``
-
-The default is to run each scenario once. To run each scenario 10 times, use:
-
-``python run_scenarios.py --repeat 10 scenarios/Examples ``
-
-The run_scenarios.py script also allows a number of command-line arguments to control various parameters of execution. Type ``python run_scenarios.py -h`` to explore these options:
-
-```
-run_scenarios.py will run the specified autogen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics.
-
-positional arguments:
-  scenario      The JSONL scenario file to run. If a directory is specified,
-                then all JSONL scenarios in the directory are run. (default:
-                ./scenarios)
-
-options:
-  -h, --help    show this help message and exit
-
-  -r REPEAT, --repeat REPEAT
-                The number of repetitions to run for each scenario (default: 1).
-
-  -c CONFIG, --config CONFIG
-                The environment variable name or path to the OAI_CONFIG_LIST (default: OAI_CONFIG_LIST).
-
-  --requirements REQUIREMENTS
-                The requirements file to pip install before running the scenario. This file must be found in
-                the 'includes' directory. (default: requirements.txt)
-
-  -d DOCKER_IMAGE, --docker-image DOCKER_IMAGE
-                The Docker image to use when running scenarios. Can not be used together with --native.
-                (default: 'autogen/testbed:default', which will be created if not present)
-
-  --native      Run the scenarios natively rather than in docker.
-                NOTE: This is not advisable, and should be done with great caution.
-```
-
-## Results
-
-By default, the Testbed stores results in a folder hierarchy with the following template:
-
-``./results/[scenario]/[instance_id]/[repetition]``
-
-For example, consider the following folders:
-
-``./results/default_two_agents_gpt35/two_agent_stocks/0``
-``./results/default_two_agents_gpt35/two_agent_stocks/1``
-
-...
-
-``./results/default_two_agents_gpt35/two_agent_stocks/9``
-
-This folder holds the results for the ``two_agent_stocks`` instance of the ``default_two_agents_gpt35`` scenario. The ``0`` folder contains the results of the first run. The ``1`` folder contains the results of the second run, and so on. You can think of the _instance_ as mapping to a prompt, or a unique set of parameters, while the _scenario_ defines the template in which those parameters are input.
-
-Within each folder, you will find the following files:
-
-- *timestamp.txt*: records the date and time of the run, along with the version of the pyautogen library installed
-- *console_log.txt*: all console output produced by Docker when running autogen. Read this like you would a regular console.
-- *chat_completions.json*: a log of all OpenAI ChatCompletions, as logged by `autogen.ChatCompletion.start_logging(compact=False)`
-- *[agent]_messages.json*: for each Agent, a log of their messages dictionaries
-- *./coding*: A directory containing all code written by Autogen, and all artifacts produced by that code.
-
-## Scenario Templating
-
-All scenarios are stored in JSONL files (in subdirectories under `./scenarios`). Each line of a scenario file is a JSON object. The schema varies slightly based on if "template" specifies a _file_ or a _directory_.
-
-If "template" points to a _file_, the format is:
-```
-{
-   "id": string,
-   "template": filename,
-   "substitutions" {
-       "find_string1": replace_string1,
-       "find_string2": replace_string2,
-       ...
-       "find_stringN": replace_stringN
-   }
-}
-```
-
-For example:
-
-```
-{
-    "id": "two_agent_stocks_gpt4",
-    "template": "default_two_agents.py",
-    "substitutions": {
-        "\__MODEL\__": "gpt-4",
-        "\__PROMPT\__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD."
-    }
-}
-```
-
-
-If "template" points to a _directory_, the format is:
-
-```
-{
-   "id": string,
-   "template": dirname,
-   "substitutions" {
-       "filename1": {
-       	   "find_string1_1": replace_string1_1,
-           "find_string1_2": replace_string1_2,
-           ...
-           "find_string1_M": replace_string1_N
-       }
-       "filename2": {
-       	   "find_string2_1": replace_string2_1,
-           "find_string2_2": replace_string2_2,
-           ...
-           "find_string2_N": replace_string2_N
-       }
-   }
-}
-```
-
-For example:
-
-```
-{
-    "id": "two_agent_stocks_gpt4",
-    "template": "default_two_agents",
-    "substitutions": {
-	"scenario.py": {
-            "\__MODEL\__": "gpt-4",
-	},
-	"prompt.txt": {
-            "\__PROMPT\__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD."
-        }
-    }
-}
-```
-
-In this example, the string `__MODEL__` will be replaced in the file `scenarios.py`, while the string `__PROMPT__` will be replaced in the `prompt.txt` file.
-
-
-## Scenario Expansion Algorithm
-
-When the Testbed runs a scenario, it creates a local folder to share with Docker. As noted above, each instance and repetition gets its own folder along the path: ``./results/[scenario]/[instance_id]/[repetition]``
-
-For the sake of brevity we will refer to this folder as the `DEST_FOLDER`.
-
-The algorithm for populating the `DEST_FOLDER` is as follows:
-
-1. Recursively copy the contents of `./includes` to DEST_FOLDER. This folder contains all the basic starter files for running a scenario, including an ENV file which will set the Docker environment variables.
-2. Append the OAI_CONFIG_LIST to the ENV file so that autogen may access these secrets.
-3. Recursively copy the scenario folder (if `template` in the json scenario definition points to a folder) to DEST_FOLDER. If the `template` instead points to a file, copy the file, but rename it to `scenario.py`
-4. Apply any templating, as outlined in the prior section.
-5. Write a run.sh file to DEST_FOLDER that will be executed by Docker when it is loaded.
-
-
-## Scenario Execution Algorithm
-
-Once the scenario has been expanded it is run (via run.sh). This script will execute the following steps:
-
-1. Read and set the ENV environment variables
-2. If a file named `global_init.sh` is present, run it.
-3. If a file named `scenario_init.sh` is present, run it.
-4. Install the requirements file (if running in Docker)
-5. Run the Autogen scenario via `python scenario.py`
-6. Clean up (delete cache, etc.)
-7. If a file named `scenario_finalize.sh` is present, run it.
-8. If a file named `global_finalize.sh` is present, run it.
-9. echo "SCENARIO COMPLETE !#!#", signaling that all steps completed.
-
-Notably, this means that scenarios can add custom init and teardown logic by including `scenario_init.sh` and `scenario_finalize.sh` files.
-
-
-## (Example) Running HumanEval
-
-One sample Testbed scenario type is a variation of the classic [HumanEval](https://github.com/openai/human-eval) benchmark. In this scenario, agents are given access to the unit test results, and are able to continue to debug their code until the problem is solved or they run out of tokens or turns. We can then count how many turns it took to solve the problem (returning -1 if the problem remains unsolved by the end of the conversation, and "" if the run is missing).
-
-Accessing this scenario-type requires downloading and converting the HumanEval dataset, running the Testbed, collating the results, and finally computing the metrics. The following commands will accomplish this, running each test instance 3 times with GPT-3.5-Turbo-16k:
-
-```
-python utils/download_humaneval.py
-python ./run_scenarios.py scenarios/HumanEval/human_eval_two_agents_gpt35.jsonl
-python utils/collate_human_eval.py ./results/human_eval_two_agents_gpt35 | python utils/metrics_human_eval.py > human_eval_results_gpt35.csv
-cat human_eval_results_gpt35.csv
-```
-
-## (Example) Running GAIA
-
-The Testbed can also be used to run the recently released [GAIA benchmark](https://huggingface.co/gaia-benchmark). This integration is presently experimental, and needs further validation. In this scenario, agents are presented with a series of questions that may include file references, or multi-modal input. Agents then must provide a `FINAL ANSWER`, which is considered correct if it (nearly) exactly matches an unambiguously accepted answer.
-
-Accessing this scenario-type requires downloading and converting the GAIA dataset, running the Testbed, collating the results, and finally computing the metrics. The following commands will accomplish this, running each test instance once with GPT-4:
-
-```
-# Clone the GAIA dataset repo (assuming a 'repos' folder in your home directory)
-cd ~/repos
-git clone https://huggingface.co/datasets/gaia-benchmark/GAIA
-
-# Expand GAIA
-cd ~/repos/autogen/samples/tools/testbed
-python ./utils/expand_gaia.py ~/repos/GAIA
-
-# Run GAIA
-python ./run_scenarios.py ./scenarios/GAIA/gaia_validation_level_1__two_agents_gpt4.jsonl
-
-# Compute Metrics
-python utils/collate_gaia_csv.py ./results/gaia_validation_level_1__two_agents_gpt4 | python utils/metrics_gaia.py
-```
-
-## (Example) Running tasks from AutoGPT
-
-The Testbed supports running tasks proposed in [AutoGPT benchmark](https://github.com/Significant-Gravitas/AutoGPT/tree/master/benchmark/agbenchmark/challenges). In this scenario, the agents are prompted to handle a diverse range of tasks, including coding, question answering according to given tasks, web scraping. Similar to scenarios in HumanEval, the agents can call the unit test script to check if the task is successfully done.
-
-Accessing this scenario-type requires converting tasks, running the Testbed, collating the results, and finally computing the metrics. The following commands will run each test instance with GPT-4:
-
-```
-# Convert tasks
-python utils/prepare_autogpt.py
-
-# Run all the scenarios with GPT-4
-python run_scenarios.py scenarios/AutoGPT/autogpt_twoagent_gpt4.jsonl
-
-# Compute metrics, the metric script shares the same one with HumanEval
-python utils/collate_autogpt.py ./results/autogpt_twoagent_gpt4 | python metrics_human_eval.py
-```
diff --git a/samples/tools/testbed/includes/ENV.example b/samples/tools/testbed/includes/ENV.example
deleted file mode 100644
index b1f190647d0..00000000000
--- a/samples/tools/testbed/includes/ENV.example
+++ /dev/null
@@ -1 +0,0 @@
-export BING_API_KEY=
diff --git a/samples/tools/testbed/includes/math_requirements.txt b/samples/tools/testbed/includes/math_requirements.txt
deleted file mode 100644
index 0600c8ce047..00000000000
--- a/samples/tools/testbed/includes/math_requirements.txt
+++ /dev/null
@@ -1,4 +0,0 @@
-git+https://github.com/microsoft/autogen.git
-sympy
-matplotlib
-numpy
diff --git a/samples/tools/testbed/includes/requirements.txt b/samples/tools/testbed/includes/requirements.txt
deleted file mode 100644
index 33070268d1f..00000000000
--- a/samples/tools/testbed/includes/requirements.txt
+++ /dev/null
@@ -1,5 +0,0 @@
-git+https://github.com/microsoft/autogen.git
-pandas
-beautifulsoup4
-requests
-pytest
diff --git a/samples/tools/testbed/run_scenarios.py b/samples/tools/testbed/run_scenarios.py
deleted file mode 100644
index 88547bab8b5..00000000000
--- a/samples/tools/testbed/run_scenarios.py
+++ /dev/null
@@ -1,474 +0,0 @@
-import os
-import errno
-import shutil
-import subprocess
-import json
-import sys
-import time
-import pathlib
-import argparse
-from autogen import config_list_from_json
-
-# What platform are we running?
-IS_WIN32 = sys.platform == "win32"
-
-# Location of the global includes dir. The contents of this directory will be copied to the Docker environment.
-GLOBAL_INCLUDES_DIR = "includes"
-
-# This is the tag given to the image that is *built* when no other image is provided.
-# Do not use this field to specify the name of an existing image (e.g., on Dockerhub)
-DEFAULT_DOCKER_IMAGE_TAG = "autogen/testbed:default"
-
-
-def run_scenarios(scenario, n_repeats, is_native, config_list, requirements, docker_image=None, results_dir="results"):
-    """
-    Run a set testbed scenarios a given number of times.
-
-    Args:
-        scenario (path):    The file or folder containing the scenario JSONL instances. If given a folder, then
-                            all JSONL files in the folder will be loaded and run.
-        n_repeats (int):    The number of times each scenario instance will be repeated
-        is_native (bool):   True if the scenario should be run locally rather than in Docker (proceed with caution!)
-        config_list (list): An Autogen OAI_CONFIG_LIST to be used when running scenarios.
-        results_dir (path): The folder were results will be saved.
-    """
-
-    files = []
-
-    # Figure out which files or folders we are working with
-    if os.path.isfile(scenario):
-        files.append(scenario)
-    elif os.path.isdir(scenario):
-        for f in os.listdir(scenario):
-            scenario_file = os.path.join(scenario, f)
-
-            if not os.path.isfile(scenario_file):
-                continue
-
-            if not scenario_file.lower().endswith(".jsonl"):
-                continue
-
-            files.append(scenario_file)
-    else:
-        raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), scenario)
-
-    # Run all the scenario files
-    for scenario_file in files:
-        scenario_name = os.path.basename(scenario_file).split(".")
-        scenario_name.pop()
-        scenario_name = ".".join(scenario_name)
-
-        scenario_dir = os.path.dirname(os.path.realpath(scenario_file))
-
-        # Each line in the scenario file is an instance. Run it.
-        with open(scenario_file) as fh:
-            for line in fh:
-                instance = json.loads(line)
-
-                # Create a folder to store the results
-                # Results base
-                if not os.path.isdir(results_dir):
-                    os.mkdir(results_dir)
-
-                # Results for the scenario
-                results_scenario = os.path.join(results_dir, scenario_name)
-                if not os.path.isdir(results_scenario):
-                    os.mkdir(results_scenario)
-
-                # Results for the instance
-                results_instance = os.path.join(results_scenario, instance["id"])
-                if not os.path.isdir(results_instance):
-                    os.mkdir(results_instance)
-
-                # Results for the repeats
-                for i in range(0, n_repeats):
-                    results_repetition = os.path.join(results_instance, str(i))
-
-                    # Skip it if it already exists
-                    if os.path.isdir(results_repetition):
-                        print(f"Found folder {results_repetition} ... Skipping.")
-                        continue
-                    print(f"Running scenario {results_repetition}")
-
-                    # Expand the scenario
-                    expand_scenario(scenario_dir, instance, results_repetition)
-
-                    # Append the config list to the ENV file
-                    with open(os.path.join(results_repetition, "ENV"), "at") as fh:
-                        config_list_json = json.dumps(config_list)
-                        fh.write(f"export OAI_CONFIG_LIST='{config_list_json}'\n")
-
-                        # If set, append the OpenAI API Key
-                        openai_api_key = os.environ.get("OPENAI_API_KEY")
-                        if openai_api_key is not None and len(openai_api_key.strip()) > 0:
-                            fh.write(f"export OPENAI_API_KEY='{openai_api_key}'\n")
-
-                    # Run the scenario
-                    if is_native:
-                        run_scenario_natively(results_repetition)
-                    else:
-                        run_scenario_in_docker(results_repetition, requirements, docker_image=docker_image)
-
-
-def expand_scenario(scenario_dir, scenario, output_dir):
-    """
-    Expand a scenario into a folder.
-    Despite some awkwardness created by backwards compatibility and notational conveniences, expansion is conceptually simple.
-    It is a series of copy commands (similar to `cp -R`), followed by a series of in-place fine and replace operations.
-    """
-
-    template = scenario["template"]
-
-    # Either key works for finding the substitutions list. "values" may be deprecated in the future
-    substitutions = scenario["substitutions"] if "substitutions" in scenario else scenario["values"]
-
-    # Older versions are only one-level deep. Convert them,
-    if len(substitutions) > 0 and isinstance(substitutions[next(iter(substitutions))], str):
-        substitutions = {"scenario.py": substitutions}
-
-    copy_operations = []
-
-    # Handle file (str), folder (str), or mapping (List) templates
-    if isinstance(template, str):
-        template_path = os.path.join(scenario_dir, template)
-        if os.path.isdir(template_path):
-            copy_operations.append((template, ""))
-        else:
-            copy_operations.append((template, "scenario.py"))
-    elif isinstance(template, list):
-        for elm in template:
-            if isinstance(elm, list):
-                copy_operations.append((elm[0], elm[1]))
-            else:
-                copy_operations.append((elm, ""))
-    else:
-        raise ValueError("expand_scenario expects an str or list for 'template'")
-
-    # The global includes folder is always copied
-    shutil.copytree(GLOBAL_INCLUDES_DIR, output_dir, ignore=shutil.ignore_patterns("*.example"), dirs_exist_ok=False)
-
-    # Expand other folders
-    for items in copy_operations:
-        src_path = pathlib.Path(os.path.join(scenario_dir, items[0])).absolute()
-        dest_path = pathlib.Path(os.path.join(output_dir, items[1])).absolute()
-
-        if os.path.isdir(src_path):
-            shutil.copytree(src_path, dest_path, dirs_exist_ok=True)
-        else:
-            if os.path.isdir(dest_path):
-                # If the destination is a directory, use the same filename
-                shutil.copyfile(src_path, os.path.join(dest_path, os.path.basename(src_path)))
-            else:
-                # Otherwise use the filename provided
-                shutil.copyfile(src_path, dest_path)
-
-    # Expand templated files
-    for templated_file in substitutions.keys():  # Keys are relative file paths
-        # Read the templated file into memory
-        template_contents = list()
-        with open(os.path.join(output_dir, templated_file), "rt") as fh:
-            for line in fh:
-                template_contents.append(line)
-
-        # Rewrite the templated file with substitutions
-        values = substitutions[templated_file]
-        with open(os.path.join(output_dir, templated_file), "wt") as fh:
-            for line in template_contents:
-                for k, v in values.items():
-                    line = line.replace(k, v)
-                fh.write(line)
-
-
-def run_scenario_natively(work_dir):
-    """
-    Run a scenario in the native environment.
-
-    Args:
-        work_dir (path): the path to the working directory previously created to house this scenario instance
-    """
-
-    # Get the current working directory
-    cwd = os.getcwd()
-
-    # Navigate to the scenario
-    os.chdir(work_dir)
-    print("\n\n" + os.getcwd() + "\n===================================================================")
-
-    # Prepare the run script
-    with open(os.path.join("run.sh"), "wt") as f:
-        f.write(
-            """#
-export AUTOGEN_TESTBED_SETTING="Native"
-
-# Read the environment variables
-. ./ENV
-
-# Run the global init script if it exists
-if [ -f global_init.sh ] ; then
-    . ./global_init.sh
-fi
-
-# Run the scenario init script if it exists
-if [ -f scenario_init.sh ] ; then
-    . ./scenario_init.sh
-fi
-
-# Run the scenario
-python scenario.py
-
-# Clean up
-rm ENV
-if [ -d .cache ] ; then
-    rm -Rf .cache
-fi
-
-# Run the scenario finalize script if it exists
-if [ -f scenario_finalize.sh ] ; then
-    . ./scenario_finalize.sh
-fi
-
-# Run the global finalize script if it exists
-if [ -f global_finalize.sh ] ; then
-    . ./global_finalize.sh
-fi
-
-echo SCENARIO COMPLETE !#!#
-"""
-        )
-
-    # Run the script and log the output
-    with open("console_log.txt", "wb") as f:
-        process = subprocess.Popen(["sh", "run.sh"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
-        for c in iter(lambda: process.stdout.read(1), b""):
-            f.write(c)
-            os.write(sys.stdout.fileno(), c)  # Write binary to stdout
-
-    # Return where we started
-    os.chdir(cwd)
-    return
-
-
-def run_scenario_in_docker(work_dir, requirements, timeout=600, docker_image=None):
-    """
-    Run a scenario in a Docker environment.
-
-    Args:
-        work_dir (path): the path to the working directory previously created to house this scenario instance
-        timeout (Optional, int): the number of seconds to allow a Docker container to run before timing out
-    """
-
-    client = docker.from_env()
-    image = None
-
-    # If the docker_image is None, then we will fetch DEFAULT_DOCKER_IMAGE_TAG, if present,
-    # or build it if missing.
-    if docker_image is None:
-        # Pull a suitable image
-        try:
-            image = client.images.get(DEFAULT_DOCKER_IMAGE_TAG)
-        except docker.errors.ImageNotFound:
-            print(f"Building default Docker image '{DEFAULT_DOCKER_IMAGE_TAG}'. This may take a few minutes...")
-            try:
-                build_default_docker_image(client, DEFAULT_DOCKER_IMAGE_TAG)
-                image = client.images.get(DEFAULT_DOCKER_IMAGE_TAG)
-            except docker.errors.DockerException:
-                print(f"Failed to build image '{DEFAULT_DOCKER_IMAGE_TAG}'")
-
-    # Otherwise get the requested image
-    else:
-        try:
-            image = client.images.get(docker_image)
-        except docker.errors.ImageNotFound:
-            # pull the image
-            print(f"Pulling image '{docker_image}'")
-            try:
-                image = client.images.pull(docker_image)
-            except docker.errors.DockerException:
-                print(f"Failed to pull image '{docker_image}'")
-
-    # Prepare the run script
-    with open(os.path.join(work_dir, "run.sh"), "wt", newline="\n") as f:
-        f.write(
-            f"""#
-export AUTOGEN_TESTBED_SETTING="Docker"
-umask 000
-
-# Read the environment variables
-. ./ENV
-
-# Run the global init script if it exists
-if [ -f global_init.sh ] ; then
-    . ./global_init.sh
-fi
-
-# Run the scenario init script if it exists
-if [ -f scenario_init.sh ] ; then
-    . ./scenario_init.sh
-fi
-
-# Run the scenario
-pip install -r {requirements}
-python scenario.py
-
-# Clean up
-rm ENV
-if [ -d .cache ] ; then
-    rm -Rf .cache
-fi
-
-# Run the scenario finalize script if it exists
-if [ -f scenario_finalize.sh ] ; then
-    . ./scenario_finalize.sh
-fi
-
-# Run the global finalize script if it exists
-if [ -f global_finalize.sh ] ; then
-    . ./global_finalize.sh
-fi
-
-echo SCENARIO COMPLETE !#!#
-"""
-        )
-
-    print("\n\n" + work_dir + "\n===================================================================")
-
-    # Create and run the container
-    abs_path = str(pathlib.Path(work_dir).absolute())
-    container = client.containers.run(
-        image,
-        command=["sh", "run.sh"],
-        working_dir="/workspace",
-        detach=True,
-        # get absolute path to the working directory
-        volumes={abs_path: {"bind": "/workspace", "mode": "rw"}},
-    )
-
-    # Poll until the container is done, or we've timed out
-    start_time = time.time()
-    while container.status != "exited" and time.time() - start_time < timeout:
-        # Reload the container object
-        container.reload()
-
-    if container.status != "exited":
-        container.stop()
-
-        logs = container.logs().decode("utf-8").rstrip() + "\nDocker timed out.\n"
-        print(logs)
-        with open(os.path.join(work_dir, "console_log.txt"), "wt") as f:
-            f.write(logs)
-
-        container.remove()
-        return
-
-    # get the container logs
-    logs = container.logs().decode("utf-8").rstrip() + "\n"
-    container.remove()
-
-    print(logs)
-    with open(os.path.join(work_dir, "console_log.txt"), "wt") as f:
-        f.write(logs)
-
-
-def build_default_docker_image(docker_client, image_tag):
-    for segment in docker_client.api.build(path=".", dockerfile="Dockerfile", rm=True, tag=image_tag, decode=True):
-        if "stream" in segment:
-            sys.stdout.write(segment["stream"])
-
-
-###############################################################################
-if __name__ == "__main__":
-    script_name = os.path.basename(__file__)
-    parser = argparse.ArgumentParser(
-        description=f"{script_name} will run the specified autogen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics.".strip()
-    )
-
-    parser.add_argument(
-        "scenario",
-        nargs="?",
-        help="The JSONL scenario file to run. If a directory is specified, then all JSONL scenarios in the directory are run. (default: ./scenarios)",
-        default="scenarios",
-    )
-    parser.add_argument(
-        "-c",
-        "--config",
-        type=str,
-        help="The environment variable name or path to the OAI_CONFIG_LIST (default: OAI_CONFIG_LIST).",
-        default="OAI_CONFIG_LIST",
-    )
-    parser.add_argument(
-        "-r", "--repeat", type=int, help="The number of repetitions to run for each scenario (default: 1).", default=1
-    )
-    parser.add_argument(
-        "--requirements",
-        type=str,
-        help="The requirements file to pip install before running the scenario. This file must be found in the '"
-        + GLOBAL_INCLUDES_DIR
-        + "' directory. (default: requirements.txt)",
-        default=None,
-    )
-    parser.add_argument(
-        "-d",
-        "--docker-image",
-        type=str,
-        help="The Docker image to use when running scenarios. Can not be used together with --native. (default: '"
-        + DEFAULT_DOCKER_IMAGE_TAG
-        + "', which will be created if not present)",
-        default=None,
-    )
-    parser.add_argument(
-        "--native",
-        action="store_true",
-        help="Run the scenarios natively rather than in docker. NOTE: This is not advisable, and should be done with great caution.",
-    )
-
-    args = parser.parse_args()
-
-    # Load the OAI_CONFIG_LIST
-    config_list = config_list_from_json(env_or_file=args.config)
-    if len(config_list) == 0:
-        raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), args.config)
-
-    # Don't allow both --docker-image and --native on the same command
-    if args.docker_image is not None and args.native:
-        sys.exit("The options --native and --docker-image can not be used together. Exiting.")
-
-    # Warn if running natively
-    if args.native:
-        if IS_WIN32:
-            sys.exit("Running scenarios with --native is not supported in Windows. Exiting.")
-
-        if args.requirements is not None:
-            sys.exit("--requirements is not compatible with --native. Exiting.")
-
-        choice = input(
-            'WARNING: Running natively, without Docker, not only poses the usual risks of executing arbitrary AI generated code on your machine, it also makes it impossible to ensure that each test starts from a known and consistent set of initial conditions. For example, if the agents spend time debugging and installing Python libraries to solve the task, then those libraries will be available to all other runs. In other words, earlier runs can influence later runs, leading to many confounds in testing.\n\nAre you absolutely sure you want to continue with native execution? Type "Yes" exactly, and in full, to proceed: '
-        )
-
-        if choice.strip().lower() != "yes":
-            sys.exit("Received '" + choice + "'. Exiting.")
-
-    # What requirements file are we working with?
-    requirements = "requirements.txt"
-    if args.requirements is not None:
-        requirements = args.requirements
-
-    is_native = True if args.native else False
-    if not is_native:
-        # Import docker
-        import docker
-
-        # Make sure the requirements file exists
-        req_file = os.path.join(GLOBAL_INCLUDES_DIR, requirements)
-        if not os.path.isfile(req_file):
-            raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), req_file)
-
-    # Warn about a common error
-    env_file = os.path.join(GLOBAL_INCLUDES_DIR, "ENV")
-    example_file = os.path.join(GLOBAL_INCLUDES_DIR, "ENV.example")
-    if not os.path.isfile(env_file):
-        shutil.copyfile(example_file, env_file)
-        sys.stderr.write(
-            f"The environment file '{env_file}' does not exist (perhaps this is your first time setting up the testbed). A default environment file has been provided, but you may want to edit it to include your API keys and configurations.\n"
-        )
-
-    run_scenarios(args.scenario, args.repeat, is_native, config_list, requirements, docker_image=args.docker_image)
diff --git a/samples/tools/testbed/scenarios/AutoGPT/README.md b/samples/tools/testbed/scenarios/AutoGPT/README.md
deleted file mode 100644
index db08a0af484..00000000000
--- a/samples/tools/testbed/scenarios/AutoGPT/README.md
+++ /dev/null
@@ -1,3 +0,0 @@
-The AutoGPT style tasks are contained in folder `challenges`.
-
-Run `python ../../utils/prepare_autogpt.py` to convert the tasks to jsonl format compatible for evaluation.
diff --git a/samples/tools/testbed/scenarios/AutoGPT/Templates/TwoAgents/scenario.py b/samples/tools/testbed/scenarios/AutoGPT/Templates/TwoAgents/scenario.py
deleted file mode 100644
index 1b71eb78391..00000000000
--- a/samples/tools/testbed/scenarios/AutoGPT/Templates/TwoAgents/scenario.py
+++ /dev/null
@@ -1,55 +0,0 @@
-from autogen import AssistantAgent, UserProxyAgent, config_list_from_json
-import testbed_utils
-
-# Assistant agent can call check.py to check if all the unit tests have passed
-testbed_utils.init()
-
-work_dir = "coding"
-target_folder = "__TARGET_FOLDER__"  # path to the artifact folder
-
-config_list = config_list_from_json("OAI_CONFIG_LIST", filter_dict={"model": ["__MODEL__"]})
-
-assistant = AssistantAgent(
-    "assistant",
-    is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
-    llm_config={
-        "config_list": config_list,
-    },
-)
-user_proxy = UserProxyAgent(
-    "user_proxy",
-    human_input_mode="NEVER",
-    is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
-    code_execution_config={
-        "work_dir": work_dir,
-        "use_docker": False,
-    },
-    max_consecutive_auto_reply=5,
-    # default_auto_reply="TERMINATE",
-)
-
-if target_folder:
-    # The tasks involves reading from a file then do sth to it.
-    message = """
-    Here is the task description: __TASK__ The file you needed is located in this directory: '__TARGET_FOLDER__'. You should save the output files in the current directory: './'
-    Run the following command to check if all the unit tests have passed:
-    ```bash
-    python ../check.py
-    ```
-    You should refine the code and results until all the tests have passed.
-    """
-else:
-    message = """
-    Here is the task description: __TASK__
-    Run the following command to check if all the unit tests have passed:
-    ```bash
-    python ../check.py
-    ```
-    You should refine the code and results until all the tests have passed.
-    """
-user_proxy.initiate_chat(
-    assistant,
-    message=message,
-)
-
-testbed_utils.finalize(agents=[assistant, user_proxy])
diff --git a/samples/tools/testbed/scenarios/Examples/default_three_agents_gpt35.jsonl b/samples/tools/testbed/scenarios/Examples/default_three_agents_gpt35.jsonl
deleted file mode 100644
index 9dc14578f0d..00000000000
--- a/samples/tools/testbed/scenarios/Examples/default_three_agents_gpt35.jsonl
+++ /dev/null
@@ -1 +0,0 @@
-{ "id": "nvda_tsla_stocks", "template": "Templates/ThreeAgents", "substitutions": { "scenario.py": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD.", "__SELECTION_METHOD__": "auto", "__3RD_AGENT_NAME__": "visualization_critic", "__3RD_AGENT_PROMPT__": "A student of Edward Tufte, you are an expert in information design, and will provide helpful critiques of visualizations. As you prepare your critiques, please consider the following dimensions:\n- Are there bugs, logic errors, syntax error or typos in the visualization code? Are there any reasons why the code may fail to run? How should it be fixed?\n- Is the data transformed appropriately for the visualization type? E.g., is the dataset appropriated filtered, aggregated, or grouped  if needed? If a date field is used, is the date field first converted to a date object etc?\n- How well does the code meet the specified visualization goals?\n- CONSIDERING BEST PRACTICES, is the visualization type appropriate for the data and intent? Is there a visualization type that would be more effective in conveying insights? \n- Are the aesthetics of the visualization appropriate for the visualization type and the data?" } } }
diff --git a/samples/tools/testbed/scenarios/Examples/default_three_agents_gpt4.jsonl b/samples/tools/testbed/scenarios/Examples/default_three_agents_gpt4.jsonl
deleted file mode 100644
index 8b1f5b717e6..00000000000
--- a/samples/tools/testbed/scenarios/Examples/default_three_agents_gpt4.jsonl
+++ /dev/null
@@ -1 +0,0 @@
-{ "id": "nvda_tsla_stocks", "template": "Templates/ThreeAgents", "substitutions": { "scenario.py": { "__MODEL__": "gpt-4", "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD.", "__SELECTION_METHOD__": "auto", "__3RD_AGENT_NAME__": "visualization_critic", "__3RD_AGENT_PROMPT__": "A student of Edward Tufte, you are an expert in information design, and will provide helpful critiques of visualizations. As you prepare your critiques, please consider the following dimensions:\n- Are there bugs, logic errors, syntax error or typos in the visualization code? Are there any reasons why the code may fail to run? How should it be fixed?\n- Is the data transformed appropriately for the visualization type? E.g., is the dataset appropriated filtered, aggregated, or grouped  if needed? If a date field is used, is the date field first converted to a date object etc?\n- How well does the code meet the specified visualization goals?\n- CONSIDERING BEST PRACTICES, is the visualization type appropriate for the data and intent? Is there a visualization type that would be more effective in conveying insights? \n- Are the aesthetics of the visualization appropriate for the visualization type and the data?" } } }
diff --git a/samples/tools/testbed/scenarios/Examples/default_two_agents_gpt35.jsonl b/samples/tools/testbed/scenarios/Examples/default_two_agents_gpt35.jsonl
deleted file mode 100644
index e67f6b40121..00000000000
--- a/samples/tools/testbed/scenarios/Examples/default_two_agents_gpt35.jsonl
+++ /dev/null
@@ -1,3 +0,0 @@
-{ "id": "nvda_tsla_stocks", "template": "Templates/TwoAgents", "substitutions": { "scenario.py": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD." } } }
-{ "id": "arxiv_search", "template": "Templates/TwoAgents", "substitutions": { "scenario.py": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Find 10 papers on explainable or interpretable AI that were submitted to arXiv within the last year. When printing results, include paper titles, authors, dates, and URLs, but not their abstracts." } } }
-{ "id": "old_mslogo_search", "template": "Templates/TwoAgents", "substitutions": { "scenario.py": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Find Microsoft's logo from 1983, and save it to disk. If searching the web, use Bing with API key stored in os.environ['BING_API_KEY']" } } }
diff --git a/samples/tools/testbed/scenarios/Examples/default_two_agents_gpt4.jsonl b/samples/tools/testbed/scenarios/Examples/default_two_agents_gpt4.jsonl
deleted file mode 100644
index 3bb941d92ab..00000000000
--- a/samples/tools/testbed/scenarios/Examples/default_two_agents_gpt4.jsonl
+++ /dev/null
@@ -1,3 +0,0 @@
-{ "id": "nvda_tsla_stocks", "template": "Templates/TwoAgents", "substitutions": { "scenario.py": { "__MODEL__": "gpt-4", "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD." } } }
-{ "id": "arxiv_search", "template": "Templates/TwoAgents", "substitutions": { "scenario.py": { "__MODEL__": "gpt-4", "__PROMPT__": "Find 10 papers on explainable or interpretable AI that were submitted to arXiv within the last year. When printing results, include paper titles, authors, dates, and URLs, but not their abstracts." } } }
-{ "id": "old_mslogo_search", "template": "Templates/TwoAgents", "substitutions": { "scenario.py": { "__MODEL__": "gpt-4", "__PROMPT__": "Find Microsoft's logo from 1983, and save it to disk. If searching the web, use Bing with API key stored in os.environ['BING_API_KEY']" } } }
diff --git a/samples/tools/testbed/scenarios/HumanEval/README.md b/samples/tools/testbed/scenarios/HumanEval/README.md
deleted file mode 100644
index b0748865807..00000000000
--- a/samples/tools/testbed/scenarios/HumanEval/README.md
+++ /dev/null
@@ -1 +0,0 @@
-Run `python ../../utils/download_humaneval.py` to populate this folder.
diff --git a/samples/tools/testbed/scenarios/HumanEval/Templates/TwoAgents/coding/my_tests.py b/samples/tools/testbed/scenarios/HumanEval/Templates/TwoAgents/coding/my_tests.py
deleted file mode 100644
index 951a4083111..00000000000
--- a/samples/tools/testbed/scenarios/HumanEval/Templates/TwoAgents/coding/my_tests.py
+++ /dev/null
@@ -1,10 +0,0 @@
-# Disable ruff linter for template files
-# ruff: noqa: F821
-
-__TEST__
-
-
-def run_tests(candidate):
-    check(candidate)
-    # We can search for this string in the output
-    print("ALL TESTS PASSED !#!#\nTERMINATE")
diff --git a/samples/tools/testbed/scenarios/MATH/README.md b/samples/tools/testbed/scenarios/MATH/README.md
deleted file mode 100644
index 7fea2cd0f4b..00000000000
--- a/samples/tools/testbed/scenarios/MATH/README.md
+++ /dev/null
@@ -1,27 +0,0 @@
-## Get json file to run
-
-This will convert the math problems to json format and put it in the `scenarios/MATH` folder.
-```sh
-cd samples/tools/testbed/
-python scenarios/MATH/problems_to_json.py
-```
-
-## Run the testbed
-
-Note: this will first run autogen on the math problems, and then use a LLM as answer checker to check the answers.
-This means the results is not 100% accurate.
-
-```sh
-python run_scenarios.py scenarios/MATH/problems.jsonl -c <config_list> --requirements math_requirements.txt
-```
-
-## Get the correct count
-Use `--path` or `-p` to specify the path to the problem directory, the default is `./results/problems/`, which is the default save path of this testbed.
-```sh
-python scenarios/MATH/count_correct_math.py --path <path_to_problem_dir>
-```
-
-Example output:
-```
-Trial 0 | Total Correct: 10 | Total Problems: 17
-```
diff --git a/samples/tools/testbed/scenarios/MATH/count_correct_math.py b/samples/tools/testbed/scenarios/MATH/count_correct_math.py
deleted file mode 100644
index 69766dfb0c5..00000000000
--- a/samples/tools/testbed/scenarios/MATH/count_correct_math.py
+++ /dev/null
@@ -1,56 +0,0 @@
-import argparse
-import json
-import os
-
-
-def main(args):
-    stars = "*" * 100
-
-    # initiate the correct count for each trial
-    correct_count = [0 for i in range(args.num_trials)]
-
-    for i in range(args.num_trials):
-        for problem_name in os.listdir(args.path):
-            problem_path = os.path.join(args.path, problem_name, str(i))
-            if os.path.isdir(problem_path):
-                checker_file_path = os.path.join(problem_path, "checker_messages.json")
-
-                with open(checker_file_path, "r") as file:
-                    checker_messages = json.load(file)
-
-                    check_result = checker_messages["checker_proxy"][-1]["content"].lower()
-
-                    if (
-                        "the answer is correct" in check_result
-                        or "the answer is approximated but should be correct" in check_result
-                    ):
-                        correct_count[i] += 1
-                        # print(f"{problem_name} | Correct")
-                    # else:
-                    # print(f"{problem_name} | Wrong")
-
-        print(f"{stars}\nTrial {i} | Total Correct: {correct_count[i]} | Total Problems: {len(os.listdir(args.path))}")
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="""Print Math Problems results.""".strip(),
-    )
-    parser.add_argument(
-        "--path",
-        "-p",
-        type=str,
-        default="./results/problems/",
-        help="Path to the problems directory",
-    )
-    # num trials
-    parser.add_argument(
-        "--num_trials",
-        "-n",
-        type=int,
-        default=1,
-        help="Number of trials to check",
-    )
-
-    args = parser.parse_args()
-    main(args)
diff --git a/samples/tools/testbed/scenarios/MATH/problems_to_json.py b/samples/tools/testbed/scenarios/MATH/problems_to_json.py
deleted file mode 100644
index 4dd9dba0d12..00000000000
--- a/samples/tools/testbed/scenarios/MATH/problems_to_json.py
+++ /dev/null
@@ -1,77 +0,0 @@
-import json
-
-problems = [
-    "Find all $x$ that satisfy the inequality $(2x+10)(x+3)<(3x+9)(x+8)$. Express your answer in interval notation.",
-    "Find the value of $a_2+a_4+a_6+a_8+\\dots+a_{98}$ if $a_1, a_2, a_3, \\ldots$ is an arithmetic progression with common difference $1$ and \\[a_1+a_2+a_3+\\dots+a_{98}=137.\\]",
-    "Tina the tourist goes on a trip. She starts at the origin and drives north (in the positive $y$ direction) for $10$ units. Then she turns east (the positive $x$ direction) and as she's turning her camera flies out the window and lands exactly at $(0,10)$. She then drives $9$ units east, turns and drives $8$ units north.  She continues this pattern of turning and driving one unit less than after the previous turn, until stopping after driving $1$ unit east. She reaches for her camera only to find it missing! She activates the GPS homing device on her camera and drives back to it in a straight line. What is the equation of this line? Express your answer as $ax+by=c$, where $a$, $b$, and $c$ are integers, $a>0$, and $a$ is as small as possible.",
-    "For what negative value of $k$ is there exactly one solution to the system of equations \\begin{align*}\ny &= 2x^2 + kx + 6 \\\\\ny &= -x + 4?\n\\end{align*}",
-    "If $\\frac{3x^2-4x+1}{x-1}=m$, and $x$ can be any real number except $1$, what real values can $m$ NOT have?",
-    "Find all numbers $a$ for which the graph of $y=x^2+a$ and the graph of $y=ax$ intersect. Express your answer in interval notation.",
-    "If $\\displaystyle{f(x)=x^{(x+1)}(x+2)^{(x+3)}}$, then find the value of $f(0)+f(-1)+f(-2)+f(-3)$.",
-    "An envelope contains eight bills: 2 ones, 2 fives, 2 tens, and 2 twenties. Two bills are drawn at random without replacement. What is the probability that their sum is $\\$20$ or more?",
-    "Find the coefficient of $x^2$ in the expansion of the product $$(1-x)(1+2x)(1-3x)\\dotsm(1+14x)(1-15x).$$",
-    "All 50 states as well as the District of Columbia and Puerto Rico, have distinct two-letter postal abbreviations. If a two-letter sequence of letters (such as CO or EE) is chosen at random, what is the probability that it is a postal abbreviation for one of the 50 states, the District of Columbia, or Puerto Rico? Express your answer as a common fraction.",
-    "Let $x$ and $y$ be real numbers.  Find the set of possible values of\n\\[\\frac{(x + y)(1 - xy)}{(1 + x^2)(1 + y^2)}.\\]",
-    "On a number line, the coordinates of $P$ and $Q$ are 8 and 48, respectively. The midpoint of $\\overline{PQ}$ is $B$, the midpoint of $\\overline{BQ}$ is $C$, and the midpoint of $\\overline{PC}$ is $D$. What is the coordinate of $D$?",
-    "Find $24^{-1} \\pmod{11^2}$. That is, find the residue $b$ for which $24b \\equiv 1\\pmod{11^2}$.\n\nExpress your answer as an integer from $0$ to $11^2-1$, inclusive.",
-    "There are two cameras that take pictures of a traffic intersection. Camera A starts taking pictures at $6$ AM and takes a picture every $11$ minutes. Camera B starts taking pictures at $7$ AM and takes pictures every $7$ minutes. Camera A and Camera B take a picture at the same time at four different times before noon. When Camera A and Camera B take their last picture together, how many minutes before noon is it?",
-    "Let $z$ be a complex number such that $z^{13} = 1.$  Let $w_1,$ $w_2,$ $\\dots,$ $w_k$ be all the possible values of\n\\[z + z^3 + z^4 + z^9 + z^{10} + z^{12}.\\]Find $w_1^2 + w_2^2 + \\dots + w_k^2.$",
-    "There are 190 people on the beach. 110 are wearing sunglasses, 70 are wearing bathing suits, and 95 are wearing a hat.  Everyone is wearing at least one of these items. 30 are wearing both bathing suits and sunglasses. 25 are wearing both bathing suits and a hat. 40 are wearing both sunglasses and a hat.  How many people are wearing all three items?",
-    "Completely simplify and rationalize the denominator: $$\\frac{\\sqrt{160}}{\\sqrt{252}}\\times\\frac{\\sqrt{245}}{\\sqrt{108}}$$",
-]
-answers = [
-    # 6 algebra
-    "(-\\infty, -14)\\cup(-3,\\infty)",
-    "93",
-    "4x-5y=-50",
-    "-5",
-    "2",
-    "(-\\infty,0]\\cup[4,\\infty)",
-    # 11 problems, 2 from each category, (1 algebra is deleted)
-    "\\frac{10}{9}",
-    "\\frac{1}{2}",
-    "-588",
-    " \\frac{1}{13}",
-    "\\left[ -\\frac{1}{2}, \\frac{1}{2} \\right]",
-    "23",
-    "116",
-    "41",
-    "43",
-    "10",
-    "\\frac{5\\sqrt{42}}{27}",
-]
-
-
-def problem_to_json():
-    with open("problems.jsonl", "w") as f:
-        for i, problem in enumerate(problems):
-            # a = {
-            #     'id': problem{i}',
-            #     'template': 'scenario.py',
-            #     'substitutions': {
-            #         '__PROMPT__': problem,
-            #         '__ANSWER__': answers[i],
-            #     },
-            # }
-            a = {
-                "id": f"problem{i}",
-                "template": "./",
-                "substitutions": {"prompt.txt": {"__PROMPT__": problem}, "answer.txt": {"__ANSWER__": answers[i]}},
-            }
-            # Convert the dictionary to a JSON string and write it to the file
-            json_string = json.dumps(a)
-            f.write(json_string + "\n")  # Add a newline character after each JSON object
-
-
-problem_to_json()
-
-problems = []
-with open("problems.jsonl", "r") as file:
-    for line in file:
-        # Parse each line as a JSON object
-        problem = json.loads(line)
-        problems.append(problem)
-        print(problem["substitutions"])
-        print()
-
-# Now 'problems' is a list of dictionaries, each representing a problem
diff --git a/samples/tools/testbed/utils/collate_autogpt.py b/samples/tools/testbed/utils/collate_autogpt.py
deleted file mode 100644
index 3dc8bcdba59..00000000000
--- a/samples/tools/testbed/utils/collate_autogpt.py
+++ /dev/null
@@ -1,108 +0,0 @@
-import argparse
-import os
-import re
-import subprocess
-import sys
-
-
-def collate(results_dir="results"):
-    """
-    Collate the results of running AutoGPT test.
-
-    Args:
-        results_dir (str, optional): The folder where results are saved. Defaults to "results".
-    """
-
-    all_results = list()
-    max_instances = 0
-
-    for test_name in os.listdir(results_dir):
-        test_path = os.path.join(results_dir, test_name)
-
-        # Collect the results vector
-        results = [test_name]
-
-        instance = 0
-        instance_dir = os.path.join(test_path, str(instance))
-        while os.path.isdir(instance_dir):
-            console_log = os.path.join(instance_dir, "console_log.txt")
-            if os.path.isfile(console_log):
-                with open(console_log, "rt") as fh:
-                    content = fh.read()
-                    if "ALL TESTS PASSED!" in content:
-                        # Ideally we would have a more distinctive pattern.
-                        results.append(str(len(re.findall(r"\n(.*?) \(to (.*?)\)\:\n", content))))
-                    else:
-                        # Sometimes the task actually succeeds, but the check.py isn't properly called
-                        result = subprocess.run(
-                            [sys.executable, "../check.py"],
-                            cwd=os.path.join(instance_dir, "coding"),
-                            capture_output=True,
-                            text=True,
-                        )
-                        if "error" in result.stderr or result.returncode != 0:
-                            results.append("-1")
-                        else:
-                            # The task actually succeeds.
-                            if "ALL TESTS PASSED!" in result.stdout:
-                                results.append(str(len(re.findall(r"\n(.*?) \(to (.*?)\)\:\n", content))))
-                            else:
-                                results.append("-1")
-            else:
-                # Missing results will appear as blanks
-                results.append("")
-
-            instance += 1
-            instance_dir = os.path.join(test_path, str(instance))
-
-        max_instances = max(max_instances, instance)
-
-        # Buffer the results
-        all_results.append(results)
-
-    # Create a header
-    header = "TestName"
-    for i in range(0, max_instances):
-        header += ",Trial" + str(i)
-    print(header)
-
-    # Print a fully-populated table of results
-    for r in all_results:
-        while len(r) < max_instances + 1:
-            r.append("")
-        print(",".join(r))
-
-
-if __name__ == "__main__":
-    script_path = os.path.realpath(__file__)
-    script_name = os.path.basename(script_path)
-    script_dir = os.path.dirname(script_path)
-
-    # Path to the default results directory
-    # (relative to this script, up on directory, then into the results folder)
-    default_results_dir = os.path.realpath(os.path.join(script_dir, os.path.pardir, "results"))
-
-    parser = argparse.ArgumentParser(
-        description=f"""
-{script_name} will collate the results of the AutoGPT scenarios and output them to a CSV. The CSV format is as follows:
-
-TestName,      Trial0, Trial1, ...,    TrialN
-Test_1, x_10,   x_11,   ...,    X_1N
-Test_2, x_20,   x_21,   ...,    X_2N
-...
-Test_M, x_M0,   x_M1,   ...,    X_MN
-
-
-Where x_ij is the number of AssistantAgent conversation turns needed to pass all the tests for problem i, in Trial/repetition j. If the agent was not able to pass the tests by the end of the conversation, the value will be -1. If data for the trial is missing, the value will be an empty string "".
-""".strip(),
-        formatter_class=argparse.RawTextHelpFormatter,
-    )
-
-    parser.add_argument(
-        "scenario",
-        nargs="?",
-        help="Path to the scenario results. (default: " + default_results_dir + ")",
-        default=default_results_dir,
-    )
-    args = parser.parse_args()
-    collate(args.scenario)
diff --git a/samples/tools/testbed/utils/collate_gaia_csv.py b/samples/tools/testbed/utils/collate_gaia_csv.py
deleted file mode 100644
index 88f1ec819ed..00000000000
--- a/samples/tools/testbed/utils/collate_gaia_csv.py
+++ /dev/null
@@ -1,128 +0,0 @@
-import os
-import json
-import re
-import sys
-import argparse
-
-
-def normalize_answer(a):
-    # Lower case
-    # Trim (left and right)
-    # Replace multiple spaces with one space
-    # Remove trailing punctuation
-    return re.sub(r"[\.\!\?]+$", "", re.sub(r"\s+", " ", a.strip().lower()))
-
-
-def collate(results_dir):
-    """
-    Collate the results of running GAIA
-
-    Args:
-        results_dir (path): The folder were results were be saved.
-    """
-
-    all_results = list()
-    max_instances = 0
-
-    for test_id in os.listdir(results_dir):
-        test_path = os.path.join(results_dir, test_id)
-
-        # Collect the results vector
-        results = [test_id]
-
-        instance = 0
-        instance_dir = os.path.join(test_path, str(instance))
-        while os.path.isdir(instance_dir):
-            expected_answer_file = os.path.join(instance_dir, "expected_answer.txt")
-            if not os.path.isfile(expected_answer_file):
-                # Expected answer is missing
-                results.append("")
-
-                instance += 1
-                instance_dir = os.path.join(test_path, str(instance))
-                continue
-
-            expected_answer = "!!!NULL ANSWER!!!"
-            with open(expected_answer_file, "rt") as fh:
-                expected_answer = fh.read().strip()
-
-            console_log_file = os.path.join(instance_dir, "console_log.txt")
-            if not os.path.isfile(console_log_file):
-                # Console log file missing
-                results.append("")
-
-                instance += 1
-                instance_dir = os.path.join(test_path, str(instance))
-                continue
-
-            with open(console_log_file, "rt") as fh:
-                console_log = fh.read()
-
-                final_answer = ""
-                m = re.search(r"FINAL ANSWER:(.*?)\n", console_log, re.DOTALL)
-                if m:
-                    final_answer = m.group(1).strip()
-
-                # print(f"Expected Answer: {expected_answer}\nAutogen Answer: {final_answer}\n")
-
-                if normalize_answer(expected_answer) == normalize_answer(final_answer):
-                    results.append("1")
-                else:
-                    results.append("-1")
-
-            instance += 1
-            instance_dir = os.path.join(test_path, str(instance))
-
-        max_instances = max(max_instances, instance)
-
-        # Buffer the results
-        all_results.append(results)
-
-    # Create a header
-    header = "TestId"
-    for i in range(0, max_instances):
-        header += ",Trial" + str(i)
-    print(header)
-
-    # Print a fully-populated table of results
-    for r in all_results:
-        while len(r) < max_instances + 1:
-            r.append("")
-        print(",".join(r))
-
-
-###############################################################################
-if __name__ == "__main__":
-    script_path = os.path.realpath(__file__)
-    script_name = os.path.basename(script_path)
-    script_dir = os.path.dirname(script_path)
-
-    # Path to the default results directory
-    # (relative to this script, up on directory, then into the results folder)
-    default_results_dir = os.path.realpath(
-        os.path.join(script_dir, os.path.pardir, "results", "gaia_validation_level_1__two_agents_gpt4")
-    )
-
-    parser = argparse.ArgumentParser(
-        description=f"""
-{script_name} will collate the results of the GAIA scenarios and output them to a CSV. The CSV format is as follows:
-
-TestId,      Trial0, Trial1, ...,    TrialN
-uuid_1,      x_10,   x_11,   ...,    X_1N
-uuid_2,      x_20,   x_21,   ...,    X_2N
-...
-uuid_M,      x_M0,   x_M1,   ...,    X_MN
-
-Where uuid_i is the identifier of the ith test question, and x_ij is 1 or -1 depending on if the test passed or failed, respectively. If data for the trial is missing (e.g., due to a runtime error, the value will be an empty string "".
-""".strip(),
-        formatter_class=argparse.RawTextHelpFormatter,
-    )
-
-    parser.add_argument(
-        "scenario",
-        nargs="?",
-        help="Path to the scenario results. (default: " + default_results_dir + ")",
-        default=default_results_dir,
-    )
-    args = parser.parse_args()
-    collate(args.scenario)
diff --git a/samples/tools/testbed/utils/collate_gaia_jsonl.py b/samples/tools/testbed/utils/collate_gaia_jsonl.py
deleted file mode 100644
index 6a4ac07cad3..00000000000
--- a/samples/tools/testbed/utils/collate_gaia_jsonl.py
+++ /dev/null
@@ -1,76 +0,0 @@
-import os
-import json
-import re
-import sys
-import argparse
-
-
-def normalize_answer(a):
-    # Trim (left and right)
-    # Replace multiple spaces with one space
-    # Remove trailing punctuation
-    # Trim again
-    return re.sub(r"[\.\!\?]+$", "", re.sub(r"\s+", " ", a.strip())).strip()
-
-
-def collate(results_dir, instance=0):
-    """
-    Collate the results of running GAIA. Print the results in the format accepted by the leaderboard.
-
-    Args:
-        results_dir (path): The folder where results were be saved.
-    """
-
-    for test_id in os.listdir(results_dir):
-        test_path = os.path.join(results_dir, test_id)
-
-        instance_dir = os.path.join(test_path, str(instance))
-        console_log_file = os.path.join(instance_dir, "console_log.txt")
-
-        final_answer = ""
-        if os.path.isfile(console_log_file):
-            with open(console_log_file, "rt") as fh:
-                console_log = fh.read()
-
-                final_answer = ""
-                m = re.search(r"FINAL ANSWER:(.*?)\n", console_log, re.DOTALL)
-                if m:
-                    final_answer = normalize_answer(m.group(1))
-
-        # Clean up the GAIA logs so they don't have the Docker setup preamble
-        m = re.search(r"^.*?\r?\n(user_proxy \(to assistant\).*$)", console_log, re.DOTALL)
-        if m:
-            console_log = m.group(1)
-
-        print(json.dumps({"task_id": test_id, "model_answer": final_answer, "reasoning_trace": console_log}))
-
-
-###############################################################################
-if __name__ == "__main__":
-    script_path = os.path.realpath(__file__)
-    script_name = os.path.basename(script_path)
-    script_dir = os.path.dirname(script_path)
-
-    # Path to the default results directory
-    # (relative to this script, up on directory, then into the results folder)
-    default_results_dir = os.path.realpath(
-        os.path.join(script_dir, os.path.pardir, "results", "gaia_validation_level_1__two_agents_gpt4")
-    )
-
-    parser = argparse.ArgumentParser(
-        description=f"""
-{script_name} will collate the results of the GAIA scenarios into the jsonl format that can be submit to the GAIA leaderboard.
-
-NOTE: You will likely need to concatenate results for level 1, level 2 and level 3 to form a complete submission.
-""".strip(),
-        formatter_class=argparse.RawTextHelpFormatter,
-    )
-
-    parser.add_argument(
-        "scenario",
-        nargs="?",
-        help="Path to the scenario results. (default: " + default_results_dir + ")",
-        default=default_results_dir,
-    )
-    args = parser.parse_args()
-    collate(args.scenario)
diff --git a/samples/tools/testbed/utils/collate_human_eval.py b/samples/tools/testbed/utils/collate_human_eval.py
deleted file mode 100644
index e46c83f84fc..00000000000
--- a/samples/tools/testbed/utils/collate_human_eval.py
+++ /dev/null
@@ -1,98 +0,0 @@
-import os
-import json
-import re
-import sys
-import argparse
-
-
-def collate(results_dir):
-    """
-    Collate the results of running human eval.
-
-    Args:
-        results_dir (path): The folder where results are saved.
-    """
-
-    all_results = list()
-    max_instances = 0
-
-    for test_id in os.listdir(results_dir):
-        test_path = os.path.join(results_dir, test_id)
-
-        # Collect the results vector
-        results = [test_id]
-
-        instance = 0
-        instance_dir = os.path.join(test_path, str(instance))
-        while os.path.isdir(instance_dir):
-            console_log = os.path.join(instance_dir, "console_log.txt")
-            if os.path.isfile(console_log):
-                with open(console_log, "rt") as fh:
-                    content = fh.read()
-                    if "ALL TESTS PASSED !#!#" in content:
-                        # Ideally we would have a more distinctive pattern.
-                        results.append(str(len(re.findall(r"\n(.*?) \(to (.*?)\)\:\n", content))))
-                    else:
-                        results.append("-1")
-
-            else:
-                # Missing results will appear as blanks
-                results.append("")
-
-            instance += 1
-            instance_dir = os.path.join(test_path, str(instance))
-
-        max_instances = max(max_instances, instance)
-
-        # Buffer the results
-        all_results.append(results)
-
-    # Create a header
-    header = "TestId"
-    for i in range(0, max_instances):
-        header += ",Trial" + str(i)
-    print(header)
-
-    # Print a fully-populated table of results
-    for r in all_results:
-        while len(r) < max_instances + 1:
-            r.append("")
-        print(",".join(r))
-
-
-###############################################################################
-if __name__ == "__main__":
-    script_path = os.path.realpath(__file__)
-    script_name = os.path.basename(script_path)
-    script_dir = os.path.dirname(script_path)
-
-    # Path to the default results directory
-    # (relative to this script, up on directory, then into the results folder)
-    default_results_dir = os.path.realpath(
-        os.path.join(script_dir, os.path.pardir, "results", "human_eval_two_agents_gpt4")
-    )
-
-    parser = argparse.ArgumentParser(
-        description=f"""
-{script_name} will collate the results of the HumanEval scenarios and output them to a CSV. The CSV format is as follows:
-
-TestId,      Trial0, Trial1, ...,    TrialN
-HumanEval_1, x_10,   x_11,   ...,    X_1N
-HumanEval_2, x_20,   x_21,   ...,    X_2N
-...
-HumanEval_M, x_M0,   x_M1,   ...,    X_MN
-
-
-Where x_ij is the number of AssistantAgent conversation turns needed to pass all the tests for problem i, in Trial/repetition j. If the agent was not able to pass the tests by the end of the conversation, the value will be -1. If data for the trial is missing, the value will be an empty string "".
-""".strip(),
-        formatter_class=argparse.RawTextHelpFormatter,
-    )
-
-    parser.add_argument(
-        "scenario",
-        nargs="?",
-        help="Path to the scenario results. (default: " + default_results_dir + ")",
-        default=default_results_dir,
-    )
-    args = parser.parse_args()
-    collate(args.scenario)
diff --git a/samples/tools/testbed/utils/expand_gaia.py b/samples/tools/testbed/utils/expand_gaia.py
deleted file mode 100644
index ed751b08132..00000000000
--- a/samples/tools/testbed/utils/expand_gaia.py
+++ /dev/null
@@ -1,110 +0,0 @@
-#
-# Run this file to download the human_eval dataset, and create a corresponding testbed scenario:
-# (default: ../scenarios/human_eval_two_agents_gpt4.jsonl and ./scenarios/human_eval_two_agents_gpt35.jsonl)
-#
-
-import json
-import os
-import sys
-import shutil
-
-SCRIPT_PATH = os.path.realpath(__file__)
-SCRIPT_NAME = os.path.basename(SCRIPT_PATH)
-SCRIPT_DIR = os.path.dirname(SCRIPT_PATH)
-SCENARIOS_DIR = os.path.realpath(os.path.join(SCRIPT_DIR, os.path.pardir, "scenarios", "GAIA"))
-
-
-def create_jsonl(name, tasks, template, model):
-    """Creates a JSONL scenario file with a given name, list of HumanEval tasks, template path, and model."""
-
-    with open(os.path.join(SCENARIOS_DIR, name + ".jsonl"), "wt") as fh:
-        for task in tasks:
-            print(f"Converting: [{name}] {task['task_id']}")
-
-            # Figure out what files we need to copy
-            template_cp_list = [template]
-            if len(task["file_name"].strip()) > 0:
-                template_cp_list.append(
-                    [
-                        os.path.join("GAIA_Files", task["file_name"].strip()),
-                        os.path.join("coding", task["file_name"].strip()),
-                    ]
-                )
-
-            record = {
-                "id": task["task_id"],
-                "template": template_cp_list,
-                "substitutions": {
-                    "scenario.py": {
-                        "__MODEL__": model,
-                        "__FILE_NAME__": task["file_name"],
-                        "__PROMPT__": task["Question"],
-                    },
-                    "expected_answer.txt": {"__EXPECTED_ANSWER__": task["Final answer"]},
-                },
-            }
-
-            fh.write(json.dumps(record).strip() + "\n")
-
-
-###############################################################################
-if __name__ == "__main__":
-    if len(sys.argv) != 2:
-        sys.exit(
-            f"SYNTAX: python {SCRIPT_NAME} [path to GIA repository]\n\nNote: to clone the GAIA repository, do 'git clone https://huggingface.co/datasets/gaia-benchmark/GAIA'"
-        )
-
-    # Copy the relevant GAIA files
-    gaia_path = os.path.realpath(sys.argv[1])
-
-    gaia_validation_files = os.path.join(gaia_path, "2023", "validation")
-    gaia_test_files = os.path.join(gaia_path, "2023", "test")
-
-    if not os.path.isdir(gaia_validation_files) or not os.path.isdir(gaia_test_files):
-        sys.exit(f"Error: '{gaia_path}' does not appear to be a copy of the GAIA repository.")
-
-    gaia_merged_files = os.path.realpath(os.path.join(SCENARIOS_DIR, "GAIA_Files"))
-
-    shutil.copytree(
-        gaia_validation_files, gaia_merged_files, ignore=shutil.ignore_patterns("metadata.jsonl"), dirs_exist_ok=True
-    )
-    shutil.copytree(
-        gaia_test_files, gaia_merged_files, ignore=shutil.ignore_patterns("metadata.jsonl"), dirs_exist_ok=True
-    )
-
-    # Load the GAIA data
-    gaia_validation_tasks = [[], [], []]
-    with open(os.path.join(gaia_validation_files, "metadata.jsonl")) as fh:
-        for line in fh:
-            data = json.loads(line)
-            gaia_validation_tasks[data["Level"] - 1].append(data)
-
-    gaia_test_tasks = [[], [], []]
-    with open(os.path.join(gaia_test_files, "metadata.jsonl")) as fh:
-        for line in fh:
-            data = json.loads(line)
-            gaia_test_tasks[data["Level"] - 1].append(data)
-
-    models = {
-        "gpt4": "gpt-4",
-    }
-
-    templates = {
-        "two_agents": "Templates/BasicTwoAgents",
-    }
-
-    # Add coding directories if needed (these are usually empty and left out of the repo)
-    for template in templates.values():
-        code_dir_path = os.path.join(SCENARIOS_DIR, template, "coding")
-        if not os.path.isdir(code_dir_path):
-            os.mkdir(code_dir_path)
-
-    # Create the various combinations of [models] x [templates]
-    for m in models.items():
-        for t in templates.items():
-            create_jsonl(f"gaia_validation_level_1__{t[0]}_{m[0]}", gaia_validation_tasks[0], t[1], m[1])
-            create_jsonl(f"gaia_validation_level_2__{t[0]}_{m[0]}", gaia_validation_tasks[1], t[1], m[1])
-            create_jsonl(f"gaia_validation_level_3__{t[0]}_{m[0]}", gaia_validation_tasks[2], t[1], m[1])
-            create_jsonl(f"gaia_test_level_1__{t[0]}_{m[0]}", gaia_test_tasks[0], t[1], m[1])
-            create_jsonl(f"gaia_test_level_2__{t[0]}_{m[0]}", gaia_test_tasks[1], t[1], m[1])
-            create_jsonl(f"gaia_test_level_3__{t[0]}_{m[0]}", gaia_test_tasks[2], t[1], m[1])
diff --git a/samples/tools/testbed/utils/metrics_gaia.py b/samples/tools/testbed/utils/metrics_gaia.py
deleted file mode 100644
index 6119f4f38f4..00000000000
--- a/samples/tools/testbed/utils/metrics_gaia.py
+++ /dev/null
@@ -1,97 +0,0 @@
-import os
-import sys
-import argparse
-import csv
-
-
-def metrics(results_fh):
-    """
-    Compute metrics from collated GAIA results.
-
-    Args:
-        results_fh (File Stream): A file stream containing the collated results in CSV.
-    """
-
-    reader = csv.reader(results_fh)
-    first_row = next(reader)  # Read the first line
-
-    num_trials = len(first_row) - 1  # Don't count the first column (TestId)
-
-    # Set up the counters
-    counters = []
-    for i in range(0, num_trials):
-        counters.append({"successes": 0, "failures": 0, "missing": 0})
-
-    # Load the results. We'll need to iterate over them a few times.
-    results = list()
-    for row in reader:
-        name = row[0]
-        trials = [(None if v.strip() == "" else int(v)) for v in row[1:]]
-        for i in range(0, len(trials)):
-            v = trials[i]
-            if v is None:
-                counters[i]["missing"] += 1
-            elif v > 0:
-                counters[i]["successes"] += 1
-            else:
-                counters[i]["failures"] += 1
-
-        results.append([name, trials])
-
-    def _safe_div(num, denom):
-        if denom == 0:
-            return ""
-        else:
-            return num / denom
-
-    # Print the header
-    for i in range(0, len(counters)):
-        counter = counters[i]
-        n = counter["successes"] + counter["failures"] + counter["missing"]
-        score = _safe_div(counter["successes"], n)
-        print(f"{i},{n},{counter['successes']},{counter['failures']},{counter['missing']},{score}")
-
-
-###############################################################################
-if __name__ == "__main__":
-    script_path = os.path.realpath(__file__)
-    script_name = os.path.basename(script_path)
-    script_dir = os.path.dirname(script_path)
-
-    parser = argparse.ArgumentParser(
-        description=f"""
-{script_name} will compute metrics on the collated results of the GAIA scenarios. Use collate_gaia.py to prepare input to this script.
-
-The output will be formatted as a CSV with the following schema:
-
-Trial,  n,      successes,  failures,   missing,    score
-0       N_0,    s_0         f_0         m_0,        p_0
-0       N_1,    s_1         f_1         m_1,        p_1
-...
-M       N_M,    s_M         f_M         m_M,        p_M
-
-Where:
-
-    N_i is the number of questions in trial i
-    s_i is the number of successes in trial i
-    f_i is the number of failures in trial i
-    m_i is the number of missing values in trial i
-    p_i is the proportion of successes in trail i (i.e, s_i / N_i)
-
-""".strip(),
-        formatter_class=argparse.RawTextHelpFormatter,
-    )
-
-    parser.add_argument(
-        "scenario",
-        nargs="?",
-        help="Path to collated results. If '-' or omitted, read from stdin. (default: '-')",
-        default="-",
-    )
-    args = parser.parse_args()
-
-    if args.scenario == "" or args.scenario == "-":
-        metrics(sys.stdin)
-    else:
-        with open(args.scenario, "rt") as fh:
-            metrics(fh)
diff --git a/samples/tools/testbed/utils/metrics_human_eval.py b/samples/tools/testbed/utils/metrics_human_eval.py
deleted file mode 100644
index 25d9aa90fda..00000000000
--- a/samples/tools/testbed/utils/metrics_human_eval.py
+++ /dev/null
@@ -1,116 +0,0 @@
-import os
-import sys
-import argparse
-import csv
-
-
-def metrics(results_fh):
-    """
-    Compute metrics from collated HumanEval results.
-
-    Args:
-        results_fh (File Stream): A file stream containing the collated results in CSV.
-    """
-
-    reader = csv.reader(results_fh)
-    first_row = next(reader)  # Read the first line
-
-    num_trials = len(first_row) - 1  # Don't count the first column (TestId)
-    max_turns = 0
-    num_rows = 0
-
-    # Load the results. We'll need to iterate over them a few times.
-    results = list()
-    for row in reader:
-        num_rows += 1
-
-        name = row[0]
-        trials = [(None if v.strip() == "" else int(v)) for v in row[1:]]
-        for v in trials:
-            if v is not None:
-                max_turns = max(max_turns, v)
-        results.append([name, trials])
-
-    # Print the header
-    header = ["Trial"]
-    for i in range(1, max_turns + 1):
-        header.append("cumulative_passes_by_turn_" + str(i))
-    header.append("fails")
-    header.append("missing")
-    print(",".join(header))
-
-    # Compute the metrics
-    def _metrics_for_trial(t):
-        counts = [None]
-        fails = 0
-        missing = 0
-
-        # Compute cumulative passes for each conversation turn
-        for i in range(1, max_turns + 1):
-            counts.append(0)
-            assert len(counts) == i + 1
-
-            for r in results:
-                v = r[1][t]
-                if v is not None:
-                    v = int(v)
-                    if 0 <= v and v <= i:
-                        counts[i] += 1
-
-        # Count missing and failed
-        for r in results:
-            v = r[1][t]
-            if v is None:
-                missing += 1
-            elif int(v) < 0:
-                fails += 1
-
-        # Prepare the row in the format specified by the header
-        return str(t) + "," + ",".join([str(v) for v in counts[1:]]) + "," + str(fails) + "," + str(missing)
-
-    # Print each row
-    for t in range(0, num_trials):
-        print(_metrics_for_trial(t))
-
-
-###############################################################################
-if __name__ == "__main__":
-    script_path = os.path.realpath(__file__)
-    script_name = os.path.basename(script_path)
-    script_dir = os.path.dirname(script_path)
-
-    parser = argparse.ArgumentParser(
-        description=f"""
-{script_name} will compute metrics on the collated results of the HumanEval scenarios. Use collate_human_eval.py to prepare input to this script.
-
-The output will be formatted as a CSV with the following schema:
-
-Trial, cumulative_passes_by_turn_1, ..., cumulative_passes_by_turn_N, fails, missing
-0      x_01,                             x_0N,                        y_0,   z_0
-1      x_11,                             x_1N,                        y_1,   z_1
-...
-M      x_M1,                             x_MN,                        y_M,   z_M
-
-Where:
-
-  x_ij is the number of HumanEval problems in Trial i that achieved a passing result by conversation turn j.
-  y_i  is the number of HumanEval problems in Trial i that never achieved a passing result (they failed).
-  z_i  is the number of HumanEval problems in Trial i that have missing data.
-
-""".strip(),
-        formatter_class=argparse.RawTextHelpFormatter,
-    )
-
-    parser.add_argument(
-        "scenario",
-        nargs="?",
-        help="Path to collated results. If '-' or omitted, read from stdin. (default: '-')",
-        default="-",
-    )
-    args = parser.parse_args()
-
-    if args.scenario == "" or args.scenario == "-":
-        metrics(sys.stdin)
-    else:
-        with open(args.scenario, "rt") as fh:
-            metrics(fh)
diff --git a/samples/tools/testbed/utils/prepare_autogpt.py b/samples/tools/testbed/utils/prepare_autogpt.py
deleted file mode 100644
index 9da27973545..00000000000
--- a/samples/tools/testbed/utils/prepare_autogpt.py
+++ /dev/null
@@ -1,102 +0,0 @@
-import base64
-import glob
-import json
-import os
-import shutil
-
-current_file_dir = os.path.dirname(os.path.abspath(__file__))
-challenge_path = os.path.join(os.path.dirname(current_file_dir), "scenarios/AutoGPT/challenges")
-data_paths = glob.glob(str(challenge_path) + "/*/data.json")
-
-for data_path in data_paths:
-    print("Converting data path: ", data_path)
-    workspace = os.path.dirname(data_path)
-
-    with open(data_path, "r") as f:
-        data = json.load(f)
-
-    should_contain = data["ground"].get("should_contain", [])
-    should_not_contain = data["ground"].get("should_not_contain", [])
-    case_sensitive = data["ground"].get("case_sensitive", False)
-
-    # since 'should_contain' field may contain escape characters, this can cause problems when using str() method and eval(), I used base64 encode to avoid such problems
-    should_contain_base64 = []
-    for word in should_contain:
-        encoded_word = base64.b64encode(word.encode("utf-8")).decode("utf-8")
-        should_contain_base64.append(encoded_word)
-
-    should_not_contain_base64 = []
-    for word in should_not_contain:
-        encoded_word = base64.b64encode(word.encode("utf-8")).decode("utf-8")
-        should_not_contain_base64.append(encoded_word)
-
-    # copy all the files needed to 'coding' directory
-    # 1. 'artifacts_in' directory: all the files needed for QA
-    save_path = os.path.join(os.path.dirname(current_file_dir), "scenarios/AutoGPT")
-    artifacts_in = False
-    if os.path.exists(os.path.join(workspace, "artifacts_in")):
-        artifacts_in = True
-        target_folder = os.path.join(save_path, "Templates/TwoAgents/coding/file", data["name"])
-        if os.path.exists(target_folder):
-            shutil.rmtree(target_folder)
-        shutil.copytree(os.path.join(workspace, "artifacts_in"), target_folder)
-        # print(f"All the artifacts are copied from {os.path.join(workspace, 'artifacts_in')} to {target_folder}")
-
-    # 2. 'custom_python' directory: all the files needed for testing python code
-    if os.path.exists(os.path.join(workspace, "custom_python")):
-        target_folder = os.path.join(save_path, "Templates/TwoAgents/custom_python")
-        if not os.path.exists(target_folder):
-            os.makedirs(target_folder)
-        for filename in os.listdir(os.path.join(workspace, "custom_python")):
-            shutil.copy(os.path.join(workspace, "custom_python", filename), os.path.join(target_folder, filename))
-            # print(f"File copied from {os.path.join(workspace, 'custom_python', filename)} to {target_folder}")
-
-    record = {
-        "id": data["name"],
-        "template": "Templates/TwoAgents",
-        "substitutions": {
-            "scenario.py": {
-                "__MODEL__": "gpt-35-turbo-16k",
-                "__TASK__": data["task"],
-                "__TARGET_FOLDER__": f"file/{data['name']}" if artifacts_in else "",
-            },
-            "check.py": {
-                "__FILE_PATTERN__": data["ground"]["files"][0],
-                "__EVAL_TYPE__": data["ground"]["eval"]["type"],
-                "__CASE_SENSITIVE__": str(case_sensitive),
-            },
-            "should_contain.txt": {
-                "__CONTAIN__": str(should_contain_base64),
-            },
-            "should_not_contain.txt": {
-                "__NO_CONTAIN__": str(should_not_contain_base64),
-            },
-        },
-    }
-    with open(os.path.join(save_path, "autogpt_twoagent_gpt35.jsonl"), "a") as f:
-        f.write(json.dumps(record).strip() + "\n")
-
-    record = {
-        "id": data["name"],
-        "template": "Templates/TwoAgents",
-        "substitutions": {
-            "scenario.py": {
-                "__MODEL__": "gpt-4-1106-preview",
-                "__TASK__": data["task"],
-                "__TARGET_FOLDER__": f"file/{data['name']}" if artifacts_in else "",
-            },
-            "check.py": {
-                "__FILE_PATTERN__": data["ground"]["files"][0],
-                "__EVAL_TYPE__": data["ground"]["eval"]["type"],
-                "__CASE_SENSITIVE__": str(case_sensitive),
-            },
-            "should_contain.txt": {
-                "__CONTAIN__": str(should_contain_base64),
-            },
-            "should_not_contain.txt": {
-                "__NO_CONTAIN__": str(should_not_contain_base64),
-            },
-        },
-    }
-    with open(os.path.join(save_path, "autogpt_twoagent_gpt4.jsonl"), "a") as f:
-        f.write(json.dumps(record).strip() + "\n")
diff --git a/website/blog/2024-01-25-AutoGenBench/img/teaser.jpg b/website/blog/2024-01-25-AutoGenBench/img/teaser.jpg
new file mode 100755
index 00000000000..00571529bad
Binary files /dev/null and b/website/blog/2024-01-25-AutoGenBench/img/teaser.jpg differ
diff --git a/website/blog/2024-01-25-AutoGenBench/index.mdx b/website/blog/2024-01-25-AutoGenBench/index.mdx
new file mode 100644
index 00000000000..a1f34efeb28
--- /dev/null
+++ b/website/blog/2024-01-25-AutoGenBench/index.mdx
@@ -0,0 +1,131 @@
+---
+title: "AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents"
+authors:
+  - afourney
+  - qingyunwu
+tags: [AutoGen]
+---
+
+![AutoGenBench](img/teaser.jpg)
+<p align="center"><em>AutoGenBench is a standalone tool for evaluating AutoGen agents and workflows on common benchmarks.</em></p>
+
+
+## TLDR
+Today we are releasing AutoGenBench – a tool for evaluating AutoGen agents and workflows on established LLM and agentic benchmarks.
+
+AutoGenBench is a standalone command line tool, installable from PyPI, which handles downloading, configuring, running, and reporting supported benchmarks. AutoGenBench works best when run alongside Docker, since it uses Docker to isolate tests from one another.
+
+* See the [AutoGenBench README](https://github.com/microsoft/autogen/blob/main/samples/tools/testbed/README.md) for information on installation and running benchmarks.
+* See the [AutoGenBench CONTRIBUTING guide](https://github.com/microsoft/autogen/blob/main/samples/tools/testbed/CONTRIBUTING.md) for information on developing or contributing benchmark datasets.
+
+
+### Quick Start
+Get started quickly by running the following commands in a bash terminal.
+
+*Note:* You may need to adjust the path to the `OAI_CONFIG_LIST`, as appropriate.
+```sh
+export OAI_CONFIG_LIST=$(cat ./OAI_CONFIG_LIST)
+pip install autogenbench
+autogenbench clone HumanEval
+cd HumanEval
+cat README.md
+autogenbench run --subsample 0.1 --repeat 3 Tasks/human_eval_two_agents.jsonl
+autogenbench tabulate Results/human_eval_two_agents
+```
+
+## Introduction
+Measurement and evaluation are core components of every major AI or ML research project. The same is true for AutoGen. To this end, today we are releasing AutoGenBench, a standalone command line tool that we have been using to guide development of AutoGen. Conveniently, AutoGenBench handles: downloading, configuring, running, and reporting results of agents on various public benchmark datasets. In addition to reporting top-line numbers, each AutoGenBench run produces a comprehensive set of logs and telemetry that can be used for debugging, profiling, computing custom metrics, and as input to [AgentEval](https://microsoft.github.io/autogen/blog/2023/11/20/AgentEval). In the remainder of this blog post, we outline core design principles for AutoGenBench (key to understanding its operation); present a guide to installing and running AutoGenBench; outline a roadmap for evaluation; and conclude with an open call for contributions.
+
+## Design Principles
+AutoGenBench is designed around three core design principles. Knowing these principles will help you understand the tool, its operation and its output. These three principles are:
+-  **Repetition:** LLMs are stochastic, and in many cases, so too is the code they write to solve problems. For example, a Python script might call an external search engine, and the results may vary run-to-run. This can lead to variance in agent performance. Repetition is key to measuring and understanding this variance. To this end, AutoGenBench is built from the ground up with an understanding that tasks may be run multiple times, and that variance is a metric we often want to measure.
+
+ -  **Isolation:** Agents interact with their worlds in both subtle and overt ways. For example an agent may install a python library or write a file to disk. This can lead to ordering effects that can impact future measurements. Consider, for example, comparing two agents on a common benchmark. One agent may appear more efficient than the other simply because it ran second, and benefitted from the hard work the first agent did in installing and debugging necessary Python libraries. To address this, AutoGenBench isolates each task in its own Docker container. This ensures that all runs start with the same initial conditions.  (Docker is also a *much safer way to run agent-produced code*, in general.)
+
+-  **Instrumentation:** While top-line metrics are great for comparing agents or models, we often want much more information about how the agents are performing, where they are getting stuck, and how they can be improved. We may also later think of new research questions that require computing a different set of metrics. To this end, AutoGenBench is designed to log everything, and to compute metrics from those logs. This ensures that one can always go back to the logs to answer questions about what happened, run profiling software, or feed the logs into tools like [AgentEval](https://microsoft.github.io/autogen/blog/2023/11/20/AgentEval).
+
+## Installing and Running AutoGenBench
+As noted above, isolation is a key design principle, and so AutoGenBench must be run in an environment where Docker is available (desktop or Engine). **It will not run in GitHub codespaces**, unless you opt for native execution (which is strongly discouraged). To install Docker Desktop see [https://www.docker.com/products/docker-desktop/](https://www.docker.com/products/docker-desktop/).
+Once Docker is installed, AutoGenBench can then be installed as a standalone tool from PyPI. With `pip`, installation can be achieved as follows:
+
+```sh
+pip install autogenbench
+```
+After installation, you must configure your API keys. As with other AutoGen applications, AutoGenBench will look for the OpenAI keys in the OAI_CONFIG_LIST file in the current working directory, or the OAI_CONFIG_LIST environment variable. This behavior can be overridden using a command-line parameter.
+
+If you will be running multiple benchmarks, it is often most convenient to leverage the environment variable option. You can load your keys into the environment variable by executing:
+
+```sh
+export OAI_CONFIG_LIST=$(cat ./OAI_CONFIG_LIST)
+```
+## A Typical Session
+Once AutoGenBench and necessary keys are installed, a typical session will look as follows:
+
+```
+autogenbench clone HumanEval
+cd HumanEval
+cat README.md
+autogenbench run --subsample 0.1 --repeat 3 Tasks/human_eval_two_agents.jsonl
+autogenbench tabulate results/human_eval_two_agents
+```
+
+Where:
+- `autogenbench clone HumanEval` downloads and expands the HumanEval benchmark scenario.
+- `cd HumanEval; cat README.md` navigates to the benchmark directory, and prints the README (which you should always read!)
+- `autogenbench run --subsample 0.1 --repeat 3 Tasks/human_eval_two_agents.jsonl`
+ runs a 10% subsample of the tasks defined in `Tasks/human_eval_two_agents.jsonl`. Each task is run 3 times.
+- `autogenbench tabulate results/human_eval_two_agents` tabulates the results of the run.
+
+After running the above `tabulate` command, you should see output similar to the following:
+
+```
+                 Trial 0    Trial 1    Trial 2
+Task Id          Success    Success    Success
+-------------  ---------  ---------  ---------
+HumanEval_107       False      True       True
+HumanEval_22        True       True       True
+HumanEval_43        True       True       True
+HumanEval_88        True       True       True
+HumanEval_14        True       True       True
+HumanEval_157       True       True       True
+HumanEval_141       True       True       True
+HumanEval_57        True       True       True
+HumanEval_154       True       True       True
+HumanEval_153       True       True       True
+HumanEval_93        False      True      False
+HumanEval_137       True       True       True
+HumanEval_143       True       True       True
+HumanEval_13        True       True       True
+HumanEval_49        True       True       True
+HumanEval_95        True       True       True
+-------------  ---------  ---------  ---------
+Successes             14         16         15
+Failures               2          0          1
+Missing                0          0          0
+Total                 16         16         16
+
+CAUTION: 'autogenbench tabulate' is in early preview.
+Please do not cite these values in academic work without first inspecting and verifying the results in the logs yourself.
+```
+
+From this output we can see the results of the three separate repetitions of each task, and final summary statistics of each run. In this case, the results were generated via GPT-4 (as defined in the OAI_CONFIG_LIST that was provided), and used the `TwoAgents` template. **It is important to remember that AutoGenBench evaluates *specific* end-to-end configurations of agents (as opposed to evaluating a model or cognitive framework more generally).**
+
+Finally, complete execution traces and logs can be found in the `Results` folder. See the [AutoGenBench README](https://github.com/microsoft/autogen/blob/main/samples/tools/testbed/README.md) for more details about command-line options and output formats. Each of these commands also offers extensive in-line help via:
+
+- `autogenbench --help`
+- `autogenbench clone --help`
+- `autogenbench run --help`
+- `autogenbench tabulate --help`
+
+
+## Roadmap
+While we are announcing AutoGenBench, we note that it is very much an evolving project in its own right. Over the next few weeks and months we hope to:
+- Onboard many additional benchmarks beyond those shipping today
+- Greatly improve logging and telemetry
+- Introduce new core metrics including total costs, task completion time, conversation turns, etc.
+- Provide tighter integration with AgentEval and AutoGen Studio
+
+For an up to date tracking of our work items on this project, please see [AutoGenBench Work Items]( https://github.com/microsoft/autogen/issues/973)
+
+## Call for Participation
+Finally, we want to end this blog post with an open call for contributions. AutoGenBench is still nascent, and has much opportunity for improvement.  New benchmarks are constantly being published, and will need to be added. Everyone may have their own distinct set of metrics that they care most about optimizing, and these metrics should be onboarded. To this end, we welcome any and all contributions to this corner of the AutoGen project. If contributing is something that interests you, please see the [contributor’s guide](https://github.com/microsoft/autogen/blob/main/samples/tools/testbed/CONTRIBUTING.md) and join our [Discord](https://discord.gg/pAbnFJrkgZ) discussion in the [#autogenbench](https://discord.com/channels/1153072414184452236/1199851779328847902) channel!