Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduces AutoGenBench #1048

Merged
merged 48 commits into from
Jan 26, 2024
Merged

Introduces AutoGenBench #1048

merged 48 commits into from
Jan 26, 2024

Conversation

afourney
Copy link
Member

@afourney afourney commented Dec 22, 2023

Why are these changes needed?

This PR introduced AutoGenBench -- a tool for running common benchmarks, and other templated tests, with the AutoGen framework. It replaces the former "testbed" tool.

For full details, see the AutoGenBench README.md

TL;DR:
It is a (pip) installable module that handles benchmarking and evaluation tasks. An example session might resemble the following:

autogenbench clone --branch autogenbench HumanEval
cd HumanEval
autogenbench run Tasks/r_human_eval_two_agents.jsonl
autogenbench tabulate results/r_human_eval_two_agents

Where:

  • autogenbench clone --branch autogenbranch HumanEval downloads and expands the HumanEval benchmark scenario (from the autogenbench branch here).
  • autogenbench run Tasks/r_human_eval_two_agents.jsonl runs the tasks defined in Tasks/r_human_eval_two_agents.jsonl
  • autogenbench tablue results/r_human_eval_two_agents tabulates the results of the run

Note: You must include --branch autogenbench in the clone command when testing, otherwise it will look to the main branch and not find anything.

Related issue number

Closes #995 , #987 , #996

Supersedes #997

Checks

@afourney afourney added evaluation Issues related to evaluating the framework autogenbench Issues related to AutoGenBench. labels Dec 22, 2023
@afourney afourney mentioned this pull request Dec 22, 2023
3 tasks
@codecov-commenter
Copy link

codecov-commenter commented Dec 22, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (bcfd770) 32.48% compared to head (6089571) 32.48%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1048   +/-   ##
=======================================
  Coverage   32.48%   32.48%           
=======================================
  Files          41       41           
  Lines        4907     4907           
  Branches     1120     1120           
=======================================
  Hits         1594     1594           
  Misses       3187     3187           
  Partials      126      126           
Flag Coverage Δ
unittests 32.44% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@qingyun-wu
Copy link
Contributor

@afourney code formatting issues were fixed by running pre-commit run --all-files from my side. Also fixed some minor wording issues.

@afourney
Copy link
Member Author

@afourney code formatting issues were fixed by running pre-commit run --all-files from my side. Also fixed some minor wording issues.

Thanks. I'll re-install pre-commit hooks on my end, and hope the issues don't come back.

@Josephrp
Copy link

i'll try this out , seems really interesting and useful

@afourney afourney marked this pull request as ready for review January 2, 2024 23:44
@afourney
Copy link
Member Author

NOTE: Once this is merged, and before deleting the "autogenbench" branch, we will need to follow-up with another super-quick PR to point "autogenbench clone" to the main branch. I can't do that now because the files don't exist in that branch yet, so manual testing will break. I suspect this will be a one-time issue.

Can you fix in the code? First try via main and if the file doesn't exists fall back to the other branch?

I added a --branch switch to the clone command.

Now you can do:

autogenbench clone --branch autogenbench HumanEval   

to clone from this branch. Otherwise it defaults to main.

@afourney
Copy link
Member Author

afourney commented Jan 25, 2024

@ekzhu @qingyun-wu @victordibia @sonichi @rickyloynd-microsoft @julianakiseleva

Folks, I'm trying to get more eyeballs on this so that we can get this merged this week. Thanks @gagb for the earlier review.

If testing, be sure to use the "--branch autogenbench" parameter with the clone command since the files don't yet exist in main.

@qingyun-wu
Copy link
Contributor

qingyun-wu commented Jan 25, 2024

Thanks, @afourney, for the tremendous effort and fantastic work! This might be minor (or not): folder names should all use lowercase and underscores following Python convention. For special terms e.g., benchmark names such as AutoGPT, GAIA, it may make sense to keep them as is, but other general names such as folders Scripts Templates, Tasks, I think it better to use all lowercase.

@afourney
Copy link
Member Author

afourney commented Jan 25, 2024

Thanks, @afourney, for the tremendous effort and fantastic work! This might be minor (or not): folder names should all use lowercase and underscores following Python convention. For special terms e.g., benchmark names such as AutoGPT, GAIA, it may make sense to keep them as is, but other general names such as folders Scripts Templates, Tasks, I think it better to use all lowercase.

Yeah, that's unfortunately both minor (in terms of complexity), and also a lot of work to fix (in terms of files to change, and testing to do). Can this be handled in a PR after?

Minimally, I will have to change all the manifest files. All the init_tasks scripts. All the tabulation scripts. The documentation. And the templates.

Copy link
Contributor

@qingyun-wu qingyun-wu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine merging this PR.

@qingyun-wu qingyun-wu added this pull request to the merge queue Jan 26, 2024
Merged via the queue into main with commit cd199c7 Jan 26, 2024
22 checks passed
@sonichi sonichi deleted the autogenbench branch January 26, 2024 01:36
mtwalther pushed a commit to mtwalther/autogen that referenced this pull request Jan 26, 2024
* Initial commit of AutoGenBench

* wording

* typo

* pre-commit reformulation

* Updated README to point to contributor's guide earlier.

* Simplified the description of the JSON format.

* Added print statements to indicate when run.sh and scenario.py are starting.

* Added SocietyOfMind scenario to GAIA.

* Pointing autogenbench clone command to the latest branch.

* Temporarily disable subsample option.

* Updated the GAIA readme to specify how to define a BING API key.

* Fixed and re-enabled the subsample option.

* Added a draft of a blog post.

* Updated authors.

* Incorporating Gagan's feedback.

* Fixed code formatting.

* Updated the help string in the docs.

* Light editing of the AutoGenBench blogpost.

* Support filtering on model tags.

* Added websurfer dependencies to Dockerfile.

* Renamed testbed -> autogenbench

* Attempting to fix formatting.

* Added more gracefull handling of task timeouts (the script is allowed to terminate before Docker is stopped).

* Updated the blogpost based on Saleema's and Julia's feedback.

* Fixed formatting... again.

* Added a main MANIFEST to list available scenarios.

* Limit main manifest to directories.

* Manifests now use relative paths.

* All manifests are now relative.

* Updated the contributing guide, and address windows path issues.

* Updated the version. Fixed formatting.

* Fixed formatting.

* De-listing Examples, since it has no clear tabulate criteria.

* Updated email in pyproject

* typo in blogpost

* wording

---------

Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu>
Co-authored-by: Qingyun Wu <qingyun0327@gmail.com>
corleroux pushed a commit to corleroux/autogen that referenced this pull request Jan 30, 2024
* Initial commit of AutoGenBench

* wording

* typo

* pre-commit reformulation

* Updated README to point to contributor's guide earlier.

* Simplified the description of the JSON format.

* Added print statements to indicate when run.sh and scenario.py are starting.

* Added SocietyOfMind scenario to GAIA.

* Pointing autogenbench clone command to the latest branch.

* Temporarily disable subsample option.

* Updated the GAIA readme to specify how to define a BING API key.

* Fixed and re-enabled the subsample option.

* Added a draft of a blog post.

* Updated authors.

* Incorporating Gagan's feedback.

* Fixed code formatting.

* Updated the help string in the docs.

* Light editing of the AutoGenBench blogpost.

* Support filtering on model tags.

* Added websurfer dependencies to Dockerfile.

* Renamed testbed -> autogenbench

* Attempting to fix formatting.

* Added more gracefull handling of task timeouts (the script is allowed to terminate before Docker is stopped).

* Updated the blogpost based on Saleema's and Julia's feedback.

* Fixed formatting... again.

* Added a main MANIFEST to list available scenarios.

* Limit main manifest to directories.

* Manifests now use relative paths.

* All manifests are now relative.

* Updated the contributing guide, and address windows path issues.

* Updated the version. Fixed formatting.

* Fixed formatting.

* De-listing Examples, since it has no clear tabulate criteria.

* Updated email in pyproject

* typo in blogpost

* wording

---------

Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu>
Co-authored-by: Qingyun Wu <qingyun0327@gmail.com>
whiskyboy pushed a commit to whiskyboy/autogen that referenced this pull request Apr 17, 2024
* Blogpost for adaptation in HumanEval

* doc

* fix link

* fix link

* explain

* model

* interface

* link

* typo

* doc
whiskyboy pushed a commit to whiskyboy/autogen that referenced this pull request Apr 17, 2024
* Initial commit of AutoGenBench

* wording

* typo

* pre-commit reformulation

* Updated README to point to contributor's guide earlier.

* Simplified the description of the JSON format.

* Added print statements to indicate when run.sh and scenario.py are starting.

* Added SocietyOfMind scenario to GAIA.

* Pointing autogenbench clone command to the latest branch.

* Temporarily disable subsample option.

* Updated the GAIA readme to specify how to define a BING API key.

* Fixed and re-enabled the subsample option.

* Added a draft of a blog post.

* Updated authors.

* Incorporating Gagan's feedback.

* Fixed code formatting.

* Updated the help string in the docs.

* Light editing of the AutoGenBench blogpost.

* Support filtering on model tags.

* Added websurfer dependencies to Dockerfile.

* Renamed testbed -> autogenbench

* Attempting to fix formatting.

* Added more gracefull handling of task timeouts (the script is allowed to terminate before Docker is stopped).

* Updated the blogpost based on Saleema's and Julia's feedback.

* Fixed formatting... again.

* Added a main MANIFEST to list available scenarios.

* Limit main manifest to directories.

* Manifests now use relative paths.

* All manifests are now relative.

* Updated the contributing guide, and address windows path issues.

* Updated the version. Fixed formatting.

* Fixed formatting.

* De-listing Examples, since it has no clear tabulate criteria.

* Updated email in pyproject

* typo in blogpost

* wording

---------

Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu>
Co-authored-by: Qingyun Wu <qingyun0327@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autogenbench Issues related to AutoGenBench. evaluation Issues related to evaluating the framework
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rename run_scenarios to autogenbench, and create python cli package
5 participants