Introduces AutoGenBench #1048

afourney · 2023-12-22T20:51:19Z

Why are these changes needed?

This PR introduced AutoGenBench -- a tool for running common benchmarks, and other templated tests, with the AutoGen framework. It replaces the former "testbed" tool.

For full details, see the AutoGenBench README.md

TL;DR:
It is a (pip) installable module that handles benchmarking and evaluation tasks. An example session might resemble the following:

autogenbench clone --branch autogenbench HumanEval
cd HumanEval
autogenbench run Tasks/r_human_eval_two_agents.jsonl
autogenbench tabulate results/r_human_eval_two_agents

Where:

autogenbench clone --branch autogenbranch HumanEval downloads and expands the HumanEval benchmark scenario (from the autogenbench branch here).
autogenbench run Tasks/r_human_eval_two_agents.jsonl runs the tasks defined in Tasks/r_human_eval_two_agents.jsonl
autogenbench tablue results/r_human_eval_two_agents tabulates the results of the run

Note: You must include --branch autogenbench in the clone command when testing, otherwise it will look to the main branch and not find anything.

Related issue number

Closes #995 , #987 , #996

Supersedes #997

Checks

I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

codecov-commenter · 2023-12-22T21:07:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (bcfd770) 32.48% compared to head (6089571) 32.48%.
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1048   +/-   ##
=======================================
  Coverage   32.48%   32.48%           
=======================================
  Files          41       41           
  Lines        4907     4907           
  Branches     1120     1120           
=======================================
  Hits         1594     1594           
  Misses       3187     3187           
  Partials      126      126

Flag	Coverage Δ
unittests	`32.44% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

qingyun-wu · 2023-12-25T23:36:16Z

@afourney code formatting issues were fixed by running pre-commit run --all-files from my side. Also fixed some minor wording issues.

afourney · 2023-12-27T17:18:11Z

@afourney code formatting issues were fixed by running pre-commit run --all-files from my side. Also fixed some minor wording issues.

Thanks. I'll re-install pre-commit hooks on my end, and hope the issues don't come back.

Josephrp · 2023-12-30T16:19:49Z

i'll try this out , seems really interesting and useful

…arting.

afourney · 2024-01-25T04:12:55Z

NOTE: Once this is merged, and before deleting the "autogenbench" branch, we will need to follow-up with another super-quick PR to point "autogenbench clone" to the main branch. I can't do that now because the files don't exist in that branch yet, so manual testing will break. I suspect this will be a one-time issue.

Can you fix in the code? First try via main and if the file doesn't exists fall back to the other branch?

I added a --branch switch to the clone command.

Now you can do:

autogenbench clone --branch autogenbench HumanEval

to clone from this branch. Otherwise it defaults to main.

afourney · 2024-01-25T19:32:38Z

@ekzhu @qingyun-wu @victordibia @sonichi @rickyloynd-microsoft @julianakiseleva

Folks, I'm trying to get more eyeballs on this so that we can get this merged this week. Thanks @gagb for the earlier review.

If testing, be sure to use the "--branch autogenbench" parameter with the clone command since the files don't yet exist in main.

qingyun-wu · 2024-01-25T23:43:23Z

Thanks, @afourney, for the tremendous effort and fantastic work! This might be minor (or not): folder names should all use lowercase and underscores following Python convention. For special terms e.g., benchmark names such as AutoGPT, GAIA, it may make sense to keep them as is, but other general names such as folders Scripts Templates, Tasks, I think it better to use all lowercase.

afourney · 2024-01-25T23:49:31Z

Thanks, @afourney, for the tremendous effort and fantastic work! This might be minor (or not): folder names should all use lowercase and underscores following Python convention. For special terms e.g., benchmark names such as AutoGPT, GAIA, it may make sense to keep them as is, but other general names such as folders Scripts Templates, Tasks, I think it better to use all lowercase.

Yeah, that's unfortunately both minor (in terms of complexity), and also a lot of work to fix (in terms of files to change, and testing to do). Can this be handled in a PR after?

Minimally, I will have to change all the manifest files. All the init_tasks scripts. All the tabulation scripts. The documentation. And the templates.

qingyun-wu

I am fine merging this PR.

samples/tools/autogenbench/pyproject.toml

…genbench

* Initial commit of AutoGenBench * wording * typo * pre-commit reformulation * Updated README to point to contributor's guide earlier. * Simplified the description of the JSON format. * Added print statements to indicate when run.sh and scenario.py are starting. * Added SocietyOfMind scenario to GAIA. * Pointing autogenbench clone command to the latest branch. * Temporarily disable subsample option. * Updated the GAIA readme to specify how to define a BING API key. * Fixed and re-enabled the subsample option. * Added a draft of a blog post. * Updated authors. * Incorporating Gagan's feedback. * Fixed code formatting. * Updated the help string in the docs. * Light editing of the AutoGenBench blogpost. * Support filtering on model tags. * Added websurfer dependencies to Dockerfile. * Renamed testbed -> autogenbench * Attempting to fix formatting. * Added more gracefull handling of task timeouts (the script is allowed to terminate before Docker is stopped). * Updated the blogpost based on Saleema's and Julia's feedback. * Fixed formatting... again. * Added a main MANIFEST to list available scenarios. * Limit main manifest to directories. * Manifests now use relative paths. * All manifests are now relative. * Updated the contributing guide, and address windows path issues. * Updated the version. Fixed formatting. * Fixed formatting. * De-listing Examples, since it has no clear tabulate criteria. * Updated email in pyproject * typo in blogpost * wording --------- Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu> Co-authored-by: Qingyun Wu <qingyun0327@gmail.com>

* Blogpost for adaptation in HumanEval * doc * fix link * fix link * explain * model * interface * link * typo * doc

* Initial commit of AutoGenBench * wording * typo * pre-commit reformulation * Updated README to point to contributor's guide earlier. * Simplified the description of the JSON format. * Added print statements to indicate when run.sh and scenario.py are starting. * Added SocietyOfMind scenario to GAIA. * Pointing autogenbench clone command to the latest branch. * Temporarily disable subsample option. * Updated the GAIA readme to specify how to define a BING API key. * Fixed and re-enabled the subsample option. * Added a draft of a blog post. * Updated authors. * Incorporating Gagan's feedback. * Fixed code formatting. * Updated the help string in the docs. * Light editing of the AutoGenBench blogpost. * Support filtering on model tags. * Added websurfer dependencies to Dockerfile. * Renamed testbed -> autogenbench * Attempting to fix formatting. * Added more gracefull handling of task timeouts (the script is allowed to terminate before Docker is stopped). * Updated the blogpost based on Saleema's and Julia's feedback. * Fixed formatting... again. * Added a main MANIFEST to list available scenarios. * Limit main manifest to directories. * Manifests now use relative paths. * All manifests are now relative. * Updated the contributing guide, and address windows path issues. * Updated the version. Fixed formatting. * Fixed formatting. * De-listing Examples, since it has no clear tabulate criteria. * Updated email in pyproject * typo in blogpost * wording --------- Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu> Co-authored-by: Qingyun Wu <qingyun0327@gmail.com>

Initial commit of AutoGenBench

d68a8f6

afourney added evaluation proj-autogenbench Issues related to AutoGenBench. labels Dec 22, 2023

afourney mentioned this pull request Dec 22, 2023

Testbed refactor and rename. #997

Closed

3 tasks

Fixed merge conflicts

23fd4bf

qingyun-wu requested review from qingyun-wu, yiranwu0 and LeoLjl December 22, 2023 21:51

JieyuZ2 self-requested a review December 23, 2023 07:55

qingyun-wu added 4 commits December 25, 2023 18:13

Merge branch 'main' into autogenbench

e2517f2

wording

617bb5e

typo

fcf279b

pre-commit reformulation

6e9c3b7

Merge branch 'main' into autogenbench

a4a4134

afourney marked this pull request as ready for review January 2, 2024 23:44

afourney added 6 commits January 8, 2024 11:31

Merge main

f80f40b

Updated README to point to contributor's guide earlier.

051d0be

Merge branch 'main' into autogenbench

a81334b

Simplified the description of the JSON format.

5430191

Merge branch 'main' into autogenbench

485c923

Added print statements to indicate when run.sh and scenario.py are st…

53dbe7b

…arting.

afourney added 2 commits January 24, 2024 19:01

All manifests are now relative.

768f6d0

Updated the contributing guide, and address windows path issues.

41fb911

afourney added 2 commits January 24, 2024 20:16

Updated the version. Fixed formatting.

ee49d7d

Fixed formatting.

1b428cf

sonichi assigned qingyun-wu Jan 25, 2024

afourney requested review from rickyloynd-microsoft, victordibia and sonichi and removed request for LeoLjl January 25, 2024 19:30

afourney requested a review from julianakiseleva January 25, 2024 19:33

De-listing Examples, since it has no clear tabulate criteria.

6089571

qingyun-wu approved these changes Jan 26, 2024

View reviewed changes

qingyun-wu reviewed Jan 26, 2024

View reviewed changes

samples/tools/autogenbench/pyproject.toml Outdated Show resolved Hide resolved

afourney and others added 4 commits January 25, 2024 16:25

Updated email in pyproject

9e1bf4c

typo in blogpost

ac1f7e8

Merge branch 'autogenbench' of github.com:microsoft/autogen into auto…

7bc02e9

…genbench

wording

54d509c

qingyun-wu added this pull request to the merge queue Jan 26, 2024

Merged via the queue into main with commit cd199c7 Jan 26, 2024
22 checks passed

sonichi deleted the autogenbench branch January 26, 2024 01:36

afourney mentioned this pull request Jan 27, 2024

Decide on proper casing of AutoGenBench folder names, and make changes as needed. #1424

Closed

whiskyboy pushed a commit to whiskyboy/autogen that referenced this pull request Apr 17, 2024

Blogpost for adaptation in HumanEval (microsoft#1048)

d3a75cc

* Blogpost for adaptation in HumanEval * doc * fix link * fix link * explain * model * interface * link * typo * doc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduces AutoGenBench #1048

Introduces AutoGenBench #1048

afourney commented Dec 22, 2023 •

edited

Loading

codecov-commenter commented Dec 22, 2023 •

edited

Loading

qingyun-wu commented Dec 25, 2023

afourney commented Dec 27, 2023

Josephrp commented Dec 30, 2023

afourney commented Jan 25, 2024

afourney commented Jan 25, 2024 •

edited

Loading

qingyun-wu commented Jan 25, 2024 •

edited

Loading

afourney commented Jan 25, 2024 •

edited

Loading

qingyun-wu left a comment

Introduces AutoGenBench #1048

Introduces AutoGenBench #1048

Conversation

afourney commented Dec 22, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

codecov-commenter commented Dec 22, 2023 • edited Loading

Codecov Report

qingyun-wu commented Dec 25, 2023

afourney commented Dec 27, 2023

Josephrp commented Dec 30, 2023

afourney commented Jan 25, 2024

afourney commented Jan 25, 2024 • edited Loading

qingyun-wu commented Jan 25, 2024 • edited Loading

afourney commented Jan 25, 2024 • edited Loading

qingyun-wu left a comment

Choose a reason for hiding this comment

afourney commented Dec 22, 2023 •

edited

Loading

codecov-commenter commented Dec 22, 2023 •

edited

Loading

afourney commented Jan 25, 2024 •

edited

Loading

qingyun-wu commented Jan 25, 2024 •

edited

Loading

afourney commented Jan 25, 2024 •

edited

Loading