-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduces AutoGenBench #1048
Introduces AutoGenBench #1048
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1048 +/- ##
=======================================
Coverage 32.48% 32.48%
=======================================
Files 41 41
Lines 4907 4907
Branches 1120 1120
=======================================
Hits 1594 1594
Misses 3187 3187
Partials 126 126
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@afourney code formatting issues were fixed by running |
Thanks. I'll re-install pre-commit hooks on my end, and hope the issues don't come back. |
i'll try this out , seems really interesting and useful |
I added a --branch switch to the clone command. Now you can do:
to clone from this branch. Otherwise it defaults to main. |
@ekzhu @qingyun-wu @victordibia @sonichi @rickyloynd-microsoft @julianakiseleva Folks, I'm trying to get more eyeballs on this so that we can get this merged this week. Thanks @gagb for the earlier review. If testing, be sure to use the "--branch autogenbench" parameter with the clone command since the files don't yet exist in main. |
Thanks, @afourney, for the tremendous effort and fantastic work! This might be minor (or not): folder names should all use lowercase and underscores following Python convention. For special terms e.g., benchmark names such as AutoGPT, GAIA, it may make sense to keep them as is, but other general names such as folders |
Yeah, that's unfortunately both minor (in terms of complexity), and also a lot of work to fix (in terms of files to change, and testing to do). Can this be handled in a PR after? Minimally, I will have to change all the manifest files. All the init_tasks scripts. All the tabulation scripts. The documentation. And the templates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine merging this PR.
* Initial commit of AutoGenBench * wording * typo * pre-commit reformulation * Updated README to point to contributor's guide earlier. * Simplified the description of the JSON format. * Added print statements to indicate when run.sh and scenario.py are starting. * Added SocietyOfMind scenario to GAIA. * Pointing autogenbench clone command to the latest branch. * Temporarily disable subsample option. * Updated the GAIA readme to specify how to define a BING API key. * Fixed and re-enabled the subsample option. * Added a draft of a blog post. * Updated authors. * Incorporating Gagan's feedback. * Fixed code formatting. * Updated the help string in the docs. * Light editing of the AutoGenBench blogpost. * Support filtering on model tags. * Added websurfer dependencies to Dockerfile. * Renamed testbed -> autogenbench * Attempting to fix formatting. * Added more gracefull handling of task timeouts (the script is allowed to terminate before Docker is stopped). * Updated the blogpost based on Saleema's and Julia's feedback. * Fixed formatting... again. * Added a main MANIFEST to list available scenarios. * Limit main manifest to directories. * Manifests now use relative paths. * All manifests are now relative. * Updated the contributing guide, and address windows path issues. * Updated the version. Fixed formatting. * Fixed formatting. * De-listing Examples, since it has no clear tabulate criteria. * Updated email in pyproject * typo in blogpost * wording --------- Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu> Co-authored-by: Qingyun Wu <qingyun0327@gmail.com>
* Initial commit of AutoGenBench * wording * typo * pre-commit reformulation * Updated README to point to contributor's guide earlier. * Simplified the description of the JSON format. * Added print statements to indicate when run.sh and scenario.py are starting. * Added SocietyOfMind scenario to GAIA. * Pointing autogenbench clone command to the latest branch. * Temporarily disable subsample option. * Updated the GAIA readme to specify how to define a BING API key. * Fixed and re-enabled the subsample option. * Added a draft of a blog post. * Updated authors. * Incorporating Gagan's feedback. * Fixed code formatting. * Updated the help string in the docs. * Light editing of the AutoGenBench blogpost. * Support filtering on model tags. * Added websurfer dependencies to Dockerfile. * Renamed testbed -> autogenbench * Attempting to fix formatting. * Added more gracefull handling of task timeouts (the script is allowed to terminate before Docker is stopped). * Updated the blogpost based on Saleema's and Julia's feedback. * Fixed formatting... again. * Added a main MANIFEST to list available scenarios. * Limit main manifest to directories. * Manifests now use relative paths. * All manifests are now relative. * Updated the contributing guide, and address windows path issues. * Updated the version. Fixed formatting. * Fixed formatting. * De-listing Examples, since it has no clear tabulate criteria. * Updated email in pyproject * typo in blogpost * wording --------- Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu> Co-authored-by: Qingyun Wu <qingyun0327@gmail.com>
* Blogpost for adaptation in HumanEval * doc * fix link * fix link * explain * model * interface * link * typo * doc
* Initial commit of AutoGenBench * wording * typo * pre-commit reformulation * Updated README to point to contributor's guide earlier. * Simplified the description of the JSON format. * Added print statements to indicate when run.sh and scenario.py are starting. * Added SocietyOfMind scenario to GAIA. * Pointing autogenbench clone command to the latest branch. * Temporarily disable subsample option. * Updated the GAIA readme to specify how to define a BING API key. * Fixed and re-enabled the subsample option. * Added a draft of a blog post. * Updated authors. * Incorporating Gagan's feedback. * Fixed code formatting. * Updated the help string in the docs. * Light editing of the AutoGenBench blogpost. * Support filtering on model tags. * Added websurfer dependencies to Dockerfile. * Renamed testbed -> autogenbench * Attempting to fix formatting. * Added more gracefull handling of task timeouts (the script is allowed to terminate before Docker is stopped). * Updated the blogpost based on Saleema's and Julia's feedback. * Fixed formatting... again. * Added a main MANIFEST to list available scenarios. * Limit main manifest to directories. * Manifests now use relative paths. * All manifests are now relative. * Updated the contributing guide, and address windows path issues. * Updated the version. Fixed formatting. * Fixed formatting. * De-listing Examples, since it has no clear tabulate criteria. * Updated email in pyproject * typo in blogpost * wording --------- Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu> Co-authored-by: Qingyun Wu <qingyun0327@gmail.com>
Why are these changes needed?
This PR introduced AutoGenBench -- a tool for running common benchmarks, and other templated tests, with the AutoGen framework. It replaces the former "testbed" tool.
For full details, see the AutoGenBench README.md
TL;DR:
It is a (pip) installable module that handles benchmarking and evaluation tasks. An example session might resemble the following:
Where:
autogenbench clone --branch autogenbranch HumanEval
downloads and expands the HumanEval benchmark scenario (from the autogenbench branch here).autogenbench run Tasks/r_human_eval_two_agents.jsonl
runs the tasks defined inTasks/r_human_eval_two_agents.jsonl
autogenbench tablue results/r_human_eval_two_agents
tabulates the results of the runNote: You must include
--branch autogenbench
in the clone command when testing, otherwise it will look to the main branch and not find anything.Related issue number
Closes #995 , #987 , #996
Supersedes #997
Checks