-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
* Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Added initial support for GAIA benchmark. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Refined GAIA support, and broke scenarios down by difficulty. * Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement. * Added instructions for cloning GAIA * Updated README to fix some typos. * Added a script to format GAIA reslts for the leaderboard. * Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py Co-authored-by: LeoLjl <3110503618@qq.com> --------- Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu> Co-authored-by: LeoLjl <3110503618@qq.com>
- Loading branch information
1 parent
73d7e92
commit f8b4b42
Showing
7 changed files
with
502 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 change: 1 addition & 0 deletions
1
samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/expected_answer.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
__EXPECTED_ANSWER__ |
66 changes: 66 additions & 0 deletions
66
samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
import os | ||
import json | ||
import autogen | ||
from datetime import datetime | ||
import testbed_utils | ||
|
||
testbed_utils.init() | ||
############################## | ||
|
||
|
||
GAIA_SYSTEM_MESSAGE = ( | ||
"You are a helpful AI assistant, and today's date is " | ||
+ datetime.now().date().isoformat() | ||
+ """. | ||
I will ask you a question. Answer this question using your coding and language skills. | ||
In the following cases, suggest python code (presented in a coding block beginning ```python) or shell script (presented in a coding block beginning ```sh) for the user to execute: | ||
1. When you need to collect info, use the code to output the info you need, for example, browse or search the web, download/read a file, print the content of a webpage or a file, check the operating system. After sufficient info is printed and the task is ready to be solved based on your language skill, you can solve the task by yourself. | ||
2. When you need to perform some task with code, use the code to perform the task and output the result. Finish the task smartly. | ||
Answer the question step if you need to. If a plan is not provided, explain your plan first. Be clear which step uses code, and which step uses your language skill. | ||
The user cannot provide any other feedback or perform any other action beyond executing the code appearing in the code block. The user can't modify your code, so do not suggest incomplete code which requires users to modify. Don't use a code block if it's not intended to be executed by the user. Don't include multiple code blocks in one response. Do not ask users to copy and paste code or results. Instead, use the 'print' function for the output when relevant. Check the execution result reported by the user. | ||
If the result indicates there is an error, fix the error and output the code again. Suggest the full code instead of partial code or code changes. If the error can't be fixed or if the task is not solved even after the code is executed successfully, analyze the problem, revisit your assumption, collect additional info you need, and think of a different approach to try. | ||
When you find an answer, report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER]. | ||
YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. | ||
If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise. | ||
If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise. | ||
If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string. | ||
""".strip() | ||
) | ||
|
||
|
||
config_list = autogen.config_list_from_json( | ||
"OAI_CONFIG_LIST", | ||
filter_dict={"model": ["__MODEL__"]}, | ||
) | ||
|
||
assistant = autogen.AssistantAgent( | ||
"assistant", | ||
system_message=GAIA_SYSTEM_MESSAGE, | ||
is_termination_msg=lambda x: x.get("content", "").rstrip().find("FINAL ANSWER") >= 0, | ||
llm_config=testbed_utils.default_llm_config(config_list, timeout=180), | ||
) | ||
user_proxy = autogen.UserProxyAgent( | ||
"user_proxy", | ||
human_input_mode="NEVER", | ||
is_termination_msg=lambda x: x.get("content", "").rstrip().find("FINAL ANSWER") >= 0, | ||
code_execution_config={ | ||
"work_dir": "coding", | ||
"use_docker": False, | ||
}, | ||
max_consecutive_auto_reply=10, | ||
default_auto_reply="", | ||
) | ||
|
||
filename = "__FILE_NAME__".strip() | ||
question = """ | ||
__PROMPT__ | ||
""".strip() | ||
|
||
if len(filename) > 0: | ||
question = f"Consider the file '{filename}', which can be read from the current working directory. {question}" | ||
|
||
user_proxy.initiate_chat(assistant, message=question) | ||
|
||
|
||
############################## | ||
testbed_utils.finalize(agents=[assistant, user_proxy]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
import os | ||
import json | ||
import re | ||
import sys | ||
import argparse | ||
|
||
|
||
def normalize_answer(a): | ||
# Lower case | ||
# Trim (left and right) | ||
# Replace multiple spaces with one space | ||
# Remove trailing punctuation | ||
return re.sub(r"[\.\!\?]+$", "", re.sub(r"\s+", " ", a.strip().lower())) | ||
|
||
|
||
def collate(results_dir): | ||
""" | ||
Collate the results of running GAIA | ||
Args: | ||
results_dir (path): The folder were results were be saved. | ||
""" | ||
|
||
all_results = list() | ||
max_instances = 0 | ||
|
||
for test_id in os.listdir(results_dir): | ||
test_path = os.path.join(results_dir, test_id) | ||
|
||
# Collect the reslts vector | ||
results = [test_id] | ||
|
||
instance = 0 | ||
instance_dir = os.path.join(test_path, str(instance)) | ||
while os.path.isdir(instance_dir): | ||
expected_answer_file = os.path.join(instance_dir, "expected_answer.txt") | ||
if not os.path.isfile(expected_answer_file): | ||
# Expected ansewr is missing | ||
results.append("") | ||
|
||
instance += 1 | ||
instance_dir = os.path.join(test_path, str(instance)) | ||
continue | ||
|
||
expected_answer = "!!!NULL ANSWER!!!" | ||
with open(expected_answer_file, "rt") as fh: | ||
expected_answer = fh.read().strip() | ||
|
||
console_log_file = os.path.join(instance_dir, "console_log.txt") | ||
if not os.path.isfile(console_log_file): | ||
# Console log file missing | ||
results.append("") | ||
|
||
instance += 1 | ||
instance_dir = os.path.join(test_path, str(instance)) | ||
continue | ||
|
||
with open(console_log_file, "rt") as fh: | ||
console_log = fh.read() | ||
|
||
final_answer = "" | ||
m = re.search(r"FINAL ANSWER:(.*?)\n", console_log, re.DOTALL) | ||
if m: | ||
final_answer = m.group(1).strip() | ||
|
||
# print(f"Expected Answer: {expected_answer}\nAutogen Answer: {final_answer}\n") | ||
|
||
if normalize_answer(expected_answer) == normalize_answer(final_answer): | ||
results.append("1") | ||
else: | ||
results.append("-1") | ||
|
||
instance += 1 | ||
instance_dir = os.path.join(test_path, str(instance)) | ||
|
||
max_instances = max(max_instances, instance) | ||
|
||
# Buffer the results | ||
all_results.append(results) | ||
|
||
# Create a header | ||
header = "TestId" | ||
for i in range(0, max_instances): | ||
header += ",Trial" + str(i) | ||
print(header) | ||
|
||
# Print a fully-populated table of results | ||
for r in all_results: | ||
while len(r) < max_instances + 1: | ||
r.append("") | ||
print(",".join(r)) | ||
|
||
|
||
############################################################################### | ||
if __name__ == "__main__": | ||
script_path = os.path.realpath(__file__) | ||
script_name = os.path.basename(script_path) | ||
script_dir = os.path.dirname(script_path) | ||
|
||
# Path to the default results directory | ||
# (relative to this script, up on directory, then into the results folder) | ||
default_results_dir = os.path.realpath( | ||
os.path.join(script_dir, os.path.pardir, "results", "gaia_validation_level_1__two_agents_gpt4") | ||
) | ||
|
||
parser = argparse.ArgumentParser( | ||
description=f""" | ||
{script_name} will collate the results of the GAIA scenarios and output them to a CSV. The CSV format is as follows: | ||
TestId, Trial0, Trial1, ..., TrialN | ||
uuid_1, x_10, x_11, ..., X_1N | ||
uuid_2, x_20, x_21, ..., X_2N | ||
... | ||
uuid_M, x_M0, x_M1, ..., X_MN | ||
Where uuid_i is the identifier of the ith test question, and x_ij is 1 or -1 depending on if the test passed or failed, respectively. If data for the trial is missing (e.g., due to a runtime error, the value will be an empty string "". | ||
""".strip(), | ||
formatter_class=argparse.RawTextHelpFormatter, | ||
) | ||
|
||
parser.add_argument( | ||
"scenario", | ||
nargs="?", | ||
help="Path to the scenario results. (default: " + default_results_dir + ")", | ||
default=default_results_dir, | ||
) | ||
args = parser.parse_args() | ||
collate(args.scenario) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
import os | ||
import json | ||
import re | ||
import sys | ||
import argparse | ||
|
||
|
||
def normalize_answer(a): | ||
# Trim (left and right) | ||
# Replace multiple spaces with one space | ||
# Remove trailing punctuation | ||
# Trim again | ||
return re.sub(r"[\.\!\?]+$", "", re.sub(r"\s+", " ", a.strip())).strip() | ||
|
||
|
||
def collate(results_dir, instance=0): | ||
""" | ||
Collate the results of running GAIA. Print the results in the format acceped by the leaderboard. | ||
Args: | ||
results_dir (path): The folder were results were be saved. | ||
""" | ||
|
||
for test_id in os.listdir(results_dir): | ||
test_path = os.path.join(results_dir, test_id) | ||
|
||
instance_dir = os.path.join(test_path, str(instance)) | ||
console_log_file = os.path.join(instance_dir, "console_log.txt") | ||
|
||
final_answer = "" | ||
if os.path.isfile(console_log_file): | ||
with open(console_log_file, "rt") as fh: | ||
console_log = fh.read() | ||
|
||
final_answer = "" | ||
m = re.search(r"FINAL ANSWER:(.*?)\n", console_log, re.DOTALL) | ||
if m: | ||
final_answer = normalize_answer(m.group(1)) | ||
|
||
# Clean up the GAIA logs so they don't have the Docker setup preamble | ||
m = re.search(r"^.*?\r?\n(user_proxy \(to assistant\).*$)", console_log, re.DOTALL) | ||
if m: | ||
console_log = m.group(1) | ||
|
||
print(json.dumps({"task_id": test_id, "model_answer": final_answer, "reasoning_trace": console_log})) | ||
|
||
|
||
############################################################################### | ||
if __name__ == "__main__": | ||
script_path = os.path.realpath(__file__) | ||
script_name = os.path.basename(script_path) | ||
script_dir = os.path.dirname(script_path) | ||
|
||
# Path to the default results directory | ||
# (relative to this script, up on directory, then into the results folder) | ||
default_results_dir = os.path.realpath( | ||
os.path.join(script_dir, os.path.pardir, "results", "gaia_validation_level_1__two_agents_gpt4") | ||
) | ||
|
||
parser = argparse.ArgumentParser( | ||
description=f""" | ||
{script_name} will collate the results of the GAIA scenarios into the jsonl format that can be submit to the GAIA leaderboard. | ||
NOTE: You will likely need to concatenate resuls for level 1, level 2 and level 3 to form a complete submission. | ||
""".strip(), | ||
formatter_class=argparse.RawTextHelpFormatter, | ||
) | ||
|
||
parser.add_argument( | ||
"scenario", | ||
nargs="?", | ||
help="Path to the scenario results. (default: " + default_results_dir + ")", | ||
default=default_results_dir, | ||
) | ||
args = parser.parse_args() | ||
collate(args.scenario) |
Oops, something went wrong.