Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

openhands-agent · 2024-11-23T13:15:59Z

This PR fixes #5222 by reorganizing the evaluation directory structure to improve clarity and maintainability.

Changes

Created evaluation/benchmarks/ directory to house all ML literature benchmarks
Kept utility directories (utils, integration_tests, regression, static) directly under evaluation/
Updated paths in documentation and GitHub workflows to reflect the new structure
Added missing benchmarks to evaluation/README.md:
- Commit0 and DiscoveryBench under Software Engineering
- Browsing Delegation under Web Browsing
- ScienceAgentBench under Misc. Assistance

Testing

All pre-commit hooks pass (ruff, mypy, etc.)
All unit tests pass (377 tests)

Review Notes

Key files to review:

.github/workflows/eval-runner.yml - Updated paths for integration tests and benchmarks
evaluation/README.md - Added missing benchmarks and updated paths
Documentation files - Updated references to benchmark paths

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:c6627f8-nikolaik   --name openhands-app-c6627f8   docker.all-hands.dev/all-hands-ai/openhands:c6627f8

…arks while keeping other directories directly under evaluation/

neubig · 2024-11-23T13:51:21Z

Just noting that I have confirmed the code and it looks good to me, but I'd like a second review.

evaluation/README.md

…rks/

enyst · 2024-11-23T18:57:04Z

I'm trying to run 1 instance of swe-bench on this PR, and I get this error:

....venv/bin/python: can't open file '/Users/enyst/repos/odie/evaluation/swe_bench/run_infer.py': [Errno 2] No such file or directory

The shell scripts, every run_infer.sh in benchmarks directories, need updated, to run the right .py file:

e.g ./evaluation/benchmarks/swe-bench/scripts/run_infer.sh contains the line

COMMAND="poetry run python evaluation/swe_bench/run_infer.py \

github-actions · 2024-11-23T18:59:37Z

OpenHands started fixing the pr! You can monitor the progress here.

…rectory

openhands-agent · 2024-11-23T19:02:45Z

New OpenHands update

github-actions · 2024-11-23T19:02:46Z

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

enyst · 2024-11-23T19:19:15Z

@openhands-agent The previous comment was fixed for swe-bench.
For every other benchmark that was moved from ./evaluation directory to ./evaluation/benchmarks/, we need to verify if there are scripts that hard-coded the old structure when calling some other script. For example, in ./evaluation/benchmarks/webarena, we can search for .sh scripts, and we will find:

./evaluation/benchmarks/webarena/scripts/run_infer.sh

Read the script. We will find a line like:

COMMAND="poetry run python evaluation/webarena/run_infer.py \

This will fail because the new location is in evaluation/benchmarks/webarena/ directory.

You are a smart LLM, you understand patterns. Same pattern is repeated for these benchmarks, give or take that some have more files than others. Verify each benchmark's shell scripts, and modify accordingly.

github-actions · 2024-11-23T19:19:40Z

OpenHands started fixing the pr! You can monitor the progress here.

github-actions · 2024-11-23T19:32:06Z

OpenHands started fixing the pr! You can monitor the progress here.

…rectory

openhands-agent · 2024-11-23T19:35:26Z

New OpenHands update

github-actions · 2024-11-23T19:35:27Z

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

enyst · 2024-11-23T20:00:50Z

The previous comments were fixed for shell scripts.

After the refactoring of benchmarks from ./evaluation directory to ./evaluation/benchmarks/, it is important that human users are still able to run easily these benchmarks. In every benchmark directory, there should be a README.md file. For example, in ./evaluation/benchmarks/swe-bench, there is a README.md with instructions how to set up and run the swe-bench benchmark.

You can read it, and see it has, for example, a line like this:

./evaluation/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]

If the human user copies and pastes that line with their own data, it will fail to run the script, because of course the swe-bench run_infer.sh script has moved to ./evaluation/benchmarks/swe_bench/scripts/run_infer.sh.

You're a smart LLM and you know patterns, remember. All these benchmarks are very similar and follow the same patterns for documentation for the human users. You can verify every .md file (not only README, check for more, maybe) in each benchmark and update for this particular move. Keep it minimal, only solve this particular issue.

github-actions · 2024-11-23T20:01:20Z

OpenHands started fixing the pr! You can monitor the progress here.

…rectory

openhands-agent · 2024-11-23T20:08:06Z

New OpenHands update

github-actions · 2024-11-23T20:08:06Z

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

enyst · 2024-11-23T21:00:38Z

Not all the README.md issue (or other .md) files discussed in the previous comment were updated for human users, after the move of the benchmarks.

Some were, some were not. So we need to fix the rest.

But we have a problem. You're good, but you were not allowed to use the "str_replace_editor" tool when "old_str" is not unique in the file, so many of your attempted replacements were not performed. You had to go back and include more context. Then they were performed, but you ran out of time.

You need to understand this very well, so that we do better this time. Remember, we are refactoring the ./evaluation directory to house every benchmark under ./evaluation/benchmarks. Remember the previous comment about documentation for human users: this is what we fix now.

Usually, there were more than one occurrence of the pattern in each file (such as "/evaluation/swe_bench" to be updated to the new location). It is possible there were two occurrences, or more, where one is the syntax of the command to run the benchmark, and another is a particular example of running the command.

First, think how to do this better this time. You have two options:

you can take more context every time you want to use the str_replace_editor tool. Not a lot more, just more than the patterned string itself.
use the bash tool instead, with a bash command to perform the replacements in each .md file all at once. In this case, you need to find them first, to know what you need to do.

Make a decision, then perform it. Do not ask me about it.

github-actions · 2024-11-23T21:01:09Z

OpenHands started fixing the pr! You can monitor the progress here.

…rectory

openhands-agent · 2024-11-23T21:03:25Z

New OpenHands update

github-actions · 2024-11-23T21:03:26Z

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

enyst · 2024-11-23T21:36:43Z

You're good! Your choice to run bash scripts was brilliant!

You fixed the rest of the documentation for human users in only 4 steps this time.

Now. You did very well, and I think all we have left is to double check that there are no leftover old paths. If there are, we need to fix them.

Leftovers could be in:

.py files. This is by far the most important! The module names can break everything. You did so well with the bash tool previously, that I suggest you try it again to verify all imports in .py files or any other occurrences of the patterns, in the benchmarks files.
.sh scripts. Verify them and fix if necessary.
You can skip .md files this time, you did a very good verification last time.
Other files.

Remember to first look at all benchmarks, as we moved them from ./evaluation to ./evaluation/benchmarks/, so that you know what you work with.

Check in order and update as needed.

FINALLY:
When you finish with all of them, I want you to make an extra check:

in the root directory this time, there is a file called CREDITS.md. Read it. It probably doesn't have paths to update, that's okay, but it does have a list of benchmarks and some other details about them. Update the file according to what you have learned about our current benchmarks.

github-actions · 2024-11-23T21:37:18Z

OpenHands started fixing the pr! You can monitor the progress here.

Fix issue #5222: [Refactor]: Refactor the evaluation directory

2a3460c

github-actions bot mentioned this pull request Nov 23, 2024

[Refactor]: Refactor the evaluation directory #5222

Open

openhands-agent added 2 commits November 23, 2024 13:31

Update paths to reference evaluation/benchmarks/ directory for benchm…

ecb1d81

…arks while keeping other directories directly under evaluation/

Add missing benchmarks to evaluation/README.md

4c62196

neubig requested review from xingyaoww and mamoodi November 23, 2024 13:49

neubig marked this pull request as ready for review November 23, 2024 13:50

neubig mentioned this pull request Nov 23, 2024

docs: improve evaluation README with proper links and formatting #5221

Open

1 task

enyst self-requested a review November 23, 2024 16:25

enyst reviewed Nov 23, 2024

View reviewed changes

evaluation/README.md Show resolved Hide resolved

Fix imports to match new directory structure under evaluation/benchma…

a759dd8

…rks/

neubig requested a review from enyst November 23, 2024 18:26

enyst added the fix-me Attempt to fix this issue with OpenHands label Nov 23, 2024

Fix pr #5223: Fix issue #5222: [Refactor]: Refactor the evaluation di…

b07b554

…rectory

enyst added fix-me Attempt to fix this issue with OpenHands and removed fix-me Attempt to fix this issue with OpenHands labels Nov 23, 2024

Fix pr #5223: Fix issue #5222: [Refactor]: Refactor the evaluation di…

655af80

…rectory

enyst added fix-me Attempt to fix this issue with OpenHands and removed fix-me Attempt to fix this issue with OpenHands labels Nov 23, 2024

Fix pr #5223: Fix issue #5222: [Refactor]: Refactor the evaluation di…

e421c6d

…rectory

enyst mentioned this pull request Nov 23, 2024

[Bug]: [Resolver] Multi-line GitHub comments break the macro #5230

Open

1 task

enyst added fix-me Attempt to fix this issue with OpenHands and removed fix-me Attempt to fix this issue with OpenHands labels Nov 23, 2024

Fix pr #5223: Fix issue #5222: [Refactor]: Refactor the evaluation di…

c6627f8

…rectory

enyst added fix-me Attempt to fix this issue with OpenHands and removed fix-me Attempt to fix this issue with OpenHands labels Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

openhands-agent commented Nov 23, 2024 •

edited by github-actions bot

Loading

neubig commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

Are you sure you want to change the base?

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

Conversation

openhands-agent commented Nov 23, 2024 • edited by github-actions bot Loading

Changes

Testing

Review Notes

neubig commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024 •

edited by github-actions bot

Loading