Adds the GAIA benchark to the Testbed. This PR depends on #792 (#810)

* Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Added initial support for GAIA benchmark. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Refined GAIA support, and broke scenarios down by difficulty. * Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement. * Added instructions for cloning GAIA * Updated README to fix some typos. * Added a script to format GAIA reslts for the leaderboard. * Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py Co-authored-by: LeoLjl <3110503618@qq.com> --------- Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu> Co-authored-by: LeoLjl <3110503618@qq.com>
microsoft · Dec 6, 2023 · f8b4b42 · f8b4b42
1 parent 73d7e92
commit f8b4b42
Showing 7 changed files with 502 additions and 1 deletion.
diff --git a/samples/tools/testbed/README.md b/samples/tools/testbed/README.md
@@ -19,7 +19,8 @@ The Testbed also requires Docker (Desktop or Engine) AND the __python docker__ l
 To run the Testbed, simply execute
 ``python run_scenarios.py scenarios/Examples``
 
-The default is to run each scenario once time. To run each scenario 10 times, use:
+The default is to run each scenario once. To run each scenario 10 times, use:
+
 ``python run_scenarios.py --repeat 10 scenarios/Examples ``
 
 The run_scenarios.py script also allows a number of command-line arguments to control various parameters of execution. Type ``python run_scenarios.py -h`` to explore these options:
@@ -193,3 +194,25 @@ python ./run_scenarios.py scenarios/HumanEval/human_eval_two_agents_gpt35.jsonl
 python utils/collate_human_eval.py ./results/human_eval_two_agents_gpt35 | python utils/metrics_human_eval.py > human_eval_results_gpt35.csv
 cat human_eval_results_gpt35.csv
 ```
+
+## (Example) Running GAIA
+
+The Testbed can also be used to run the recently released [GAIA benchmark](https://huggingface.co/gaia-benchmark). This integration is presently experimental, and needs further validation. In this scenario, agents are presented with a series of questions that may include file references, or multi-modal input. Agents then must provide a `FINAL ANSWER`, which is considered correct if it (nearly) exactly matches an unambiguously accepted answer.
+
+Accessing this scenario-type requires downloading and converting the GAIA dataset, running the Testbed, collating the results, and finally computing the metrics. The following commands will accomplish this, running each test instance once with GPT-4:
+
+```
+# Clone the GAIA dataset repo (assuming a 'repos' folder in your home directory)
+cd ~/repos
+git clone https://huggingface.co/datasets/gaia-benchmark/GAIA
+
+# Expand GAIA
+cd ~/repos/autogen/samples/tools/testbed
+python ./utils/expand_gaia.py ~/repos/GAIA
+
+# Run GAIA
+python ./run_scenarios.py ./scenarios/GAIA/gaia_validation_level_1__two_agents_gpt4.jsonl
+
+# Compute Metrics
+python utils/collate_gaia_csv.py ./results/gaia_validation_level_1__two_agents_gpt4 | python utils/metrics_gaia.py
+```
diff --git a/samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/expected_answer.txt b/samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/expected_answer.txt
@@ -0,0 +1 @@
+__EXPECTED_ANSWER__
diff --git a/samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py b/samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py
@@ -0,0 +1,66 @@
+import os
+import json
+import autogen
+from datetime import datetime
+import testbed_utils
+
+testbed_utils.init()
+##############################
+
+
+GAIA_SYSTEM_MESSAGE = (
+    "You are a helpful AI assistant, and today's date is "
+    + datetime.now().date().isoformat()
+    + """.
+I will ask you a question. Answer this question using your coding and language skills.
+In the following cases, suggest python code (presented in a coding block beginning ```python) or shell script (presented in a coding block beginning ```sh) for the user to execute:
+    1. When you need to collect info, use the code to output the info you need, for example, browse or search the web, download/read a file, print the content of a webpage or a file, check the operating system. After sufficient info is printed and the task is ready to be solved based on your language skill, you can solve the task by yourself.
+    2. When you need to perform some task with code, use the code to perform the task and output the result. Finish the task smartly.
+Answer the question step if you need to. If a plan is not provided, explain your plan first. Be clear which step uses code, and which step uses your language skill.
+The user cannot provide any other feedback or perform any other action beyond executing the code appearing in the code block. The user can't modify your code, so do not suggest incomplete code which requires users to modify. Don't use a code block if it's not intended to be executed by the user. Don't include multiple code blocks in one response. Do not ask users to copy and paste code or results. Instead, use the 'print' function for the output when relevant. Check the execution result reported by the user.
+If the result indicates there is an error, fix the error and output the code again. Suggest the full code instead of partial code or code changes. If the error can't be fixed or if the task is not solved even after the code is executed successfully, analyze the problem, revisit your assumption, collect additional info you need, and think of a different approach to try.
+When you find an answer, report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER].
+YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings.
+If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise.
+If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise.
+If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
+    """.strip()
+)
+
+
+config_list = autogen.config_list_from_json(
+    "OAI_CONFIG_LIST",
+    filter_dict={"model": ["__MODEL__"]},
+)
+
+assistant = autogen.AssistantAgent(
+    "assistant",
+    system_message=GAIA_SYSTEM_MESSAGE,
+    is_termination_msg=lambda x: x.get("content", "").rstrip().find("FINAL ANSWER") >= 0,
+    llm_config=testbed_utils.default_llm_config(config_list, timeout=180),
+)
+user_proxy = autogen.UserProxyAgent(
+    "user_proxy",
+    human_input_mode="NEVER",
+    is_termination_msg=lambda x: x.get("content", "").rstrip().find("FINAL ANSWER") >= 0,
+    code_execution_config={
+        "work_dir": "coding",
+        "use_docker": False,
+    },
+    max_consecutive_auto_reply=10,
+    default_auto_reply="",
+)
+
+filename = "__FILE_NAME__".strip()
+question = """
+__PROMPT__
+""".strip()
+
+if len(filename) > 0:
+    question = f"Consider the file '{filename}', which can be read from the current working directory. {question}"
+
+user_proxy.initiate_chat(assistant, message=question)
+
+
+##############################
+testbed_utils.finalize(agents=[assistant, user_proxy])
diff --git a/samples/tools/testbed/utils/collate_gaia_csv.py b/samples/tools/testbed/utils/collate_gaia_csv.py
@@ -0,0 +1,128 @@
+import os
+import json
+import re
+import sys
+import argparse
+
+
+def normalize_answer(a):
+    # Lower case
+    # Trim (left and right)
+    # Replace multiple spaces with one space
+    # Remove trailing punctuation
+    return re.sub(r"[\.\!\?]+$", "", re.sub(r"\s+", " ", a.strip().lower()))
+
+
+def collate(results_dir):
+    """
+    Collate the results of running GAIA
+
+    Args:
+        results_dir (path): The folder were results were be saved.
+    """
+
+    all_results = list()
+    max_instances = 0
+
+    for test_id in os.listdir(results_dir):
+        test_path = os.path.join(results_dir, test_id)
+
+        # Collect the reslts vector
+        results = [test_id]
+
+        instance = 0
+        instance_dir = os.path.join(test_path, str(instance))
+        while os.path.isdir(instance_dir):
+            expected_answer_file = os.path.join(instance_dir, "expected_answer.txt")
+            if not os.path.isfile(expected_answer_file):
+                # Expected ansewr is missing
+                results.append("")
+
+                instance += 1
+                instance_dir = os.path.join(test_path, str(instance))
+                continue
+
+            expected_answer = "!!!NULL ANSWER!!!"
+            with open(expected_answer_file, "rt") as fh:
+                expected_answer = fh.read().strip()
+
+            console_log_file = os.path.join(instance_dir, "console_log.txt")
+            if not os.path.isfile(console_log_file):
+                # Console log file missing
+                results.append("")
+
+                instance += 1
+                instance_dir = os.path.join(test_path, str(instance))
+                continue
+
+            with open(console_log_file, "rt") as fh:
+                console_log = fh.read()
+
+                final_answer = ""
+                m = re.search(r"FINAL ANSWER:(.*?)\n", console_log, re.DOTALL)
+                if m:
+                    final_answer = m.group(1).strip()
+
+                # print(f"Expected Answer: {expected_answer}\nAutogen Answer: {final_answer}\n")
+
+                if normalize_answer(expected_answer) == normalize_answer(final_answer):
+                    results.append("1")
+                else:
+                    results.append("-1")
+
+            instance += 1
+            instance_dir = os.path.join(test_path, str(instance))
+
+        max_instances = max(max_instances, instance)
+
+        # Buffer the results
+        all_results.append(results)
+
+    # Create a header
+    header = "TestId"
+    for i in range(0, max_instances):
+        header += ",Trial" + str(i)
+    print(header)
+
+    # Print a fully-populated table of results
+    for r in all_results:
+        while len(r) < max_instances + 1:
+            r.append("")
+        print(",".join(r))
+
+
+###############################################################################
+if __name__ == "__main__":
+    script_path = os.path.realpath(__file__)
+    script_name = os.path.basename(script_path)
+    script_dir = os.path.dirname(script_path)
+
+    # Path to the default results directory
+    # (relative to this script, up on directory, then into the results folder)
+    default_results_dir = os.path.realpath(
+        os.path.join(script_dir, os.path.pardir, "results", "gaia_validation_level_1__two_agents_gpt4")
+    )
+
+    parser = argparse.ArgumentParser(
+        description=f"""
+{script_name} will collate the results of the GAIA scenarios and output them to a CSV. The CSV format is as follows:
+
+TestId,      Trial0, Trial1, ...,    TrialN
+uuid_1,      x_10,   x_11,   ...,    X_1N
+uuid_2,      x_20,   x_21,   ...,    X_2N
+...
+uuid_M,      x_M0,   x_M1,   ...,    X_MN
+
+Where uuid_i is the identifier of the ith test question, and x_ij is 1 or -1 depending on if the test passed or failed, respectively. If data for the trial is missing (e.g., due to a runtime error, the value will be an empty string "".
+""".strip(),
+        formatter_class=argparse.RawTextHelpFormatter,
+    )
+
+    parser.add_argument(
+        "scenario",
+        nargs="?",
+        help="Path to the scenario results. (default: " + default_results_dir + ")",
+        default=default_results_dir,
+    )
+    args = parser.parse_args()
+    collate(args.scenario)
diff --git a/samples/tools/testbed/utils/collate_gaia_jsonl.py b/samples/tools/testbed/utils/collate_gaia_jsonl.py
@@ -0,0 +1,76 @@
+import os
+import json
+import re
+import sys
+import argparse
+
+
+def normalize_answer(a):
+    # Trim (left and right)
+    # Replace multiple spaces with one space
+    # Remove trailing punctuation
+    # Trim again
+    return re.sub(r"[\.\!\?]+$", "", re.sub(r"\s+", " ", a.strip())).strip()
+
+
+def collate(results_dir, instance=0):
+    """
+    Collate the results of running GAIA. Print the results in the format acceped by the leaderboard.
+
+    Args:
+        results_dir (path): The folder were results were be saved.
+    """
+
+    for test_id in os.listdir(results_dir):
+        test_path = os.path.join(results_dir, test_id)
+
+        instance_dir = os.path.join(test_path, str(instance))
+        console_log_file = os.path.join(instance_dir, "console_log.txt")
+
+        final_answer = ""
+        if os.path.isfile(console_log_file):
+            with open(console_log_file, "rt") as fh:
+                console_log = fh.read()
+
+                final_answer = ""
+                m = re.search(r"FINAL ANSWER:(.*?)\n", console_log, re.DOTALL)
+                if m:
+                    final_answer = normalize_answer(m.group(1))
+
+        # Clean up the GAIA logs so they don't have the Docker setup preamble
+        m = re.search(r"^.*?\r?\n(user_proxy \(to assistant\).*$)", console_log, re.DOTALL)
+        if m:
+            console_log = m.group(1)
+
+        print(json.dumps({"task_id": test_id, "model_answer": final_answer, "reasoning_trace": console_log}))
+
+
+###############################################################################
+if __name__ == "__main__":
+    script_path = os.path.realpath(__file__)
+    script_name = os.path.basename(script_path)
+    script_dir = os.path.dirname(script_path)
+
+    # Path to the default results directory
+    # (relative to this script, up on directory, then into the results folder)
+    default_results_dir = os.path.realpath(
+        os.path.join(script_dir, os.path.pardir, "results", "gaia_validation_level_1__two_agents_gpt4")
+    )
+
+    parser = argparse.ArgumentParser(
+        description=f"""
+{script_name} will collate the results of the GAIA scenarios into the jsonl format that can be submit to the GAIA leaderboard.
+
+NOTE: You will likely need to concatenate resuls for level 1, level 2 and level 3 to form a complete submission.
+""".strip(),
+        formatter_class=argparse.RawTextHelpFormatter,
+    )
+
+    parser.add_argument(
+        "scenario",
+        nargs="?",
+        help="Path to the scenario results. (default: " + default_results_dir + ")",
+        default=default_results_dir,
+    )
+    args = parser.parse_args()
+    collate(args.scenario)