-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223
base: main
Are you sure you want to change the base?
Conversation
…arks while keeping other directories directly under evaluation/
Just noting that I have confirmed the code and it looks good to me, but I'd like a second review. |
I'm trying to run 1 instance of swe-bench on this PR, and I get this error:
The shell scripts, every run_infer.sh in benchmarks directories, need updated, to run the right .py file: e.g ./evaluation/benchmarks/swe-bench/scripts/run_infer.sh contains the line
|
New OpenHands update |
The workflow to fix this issue encountered an error. Please check the workflow logs for more information. |
@openhands-agent The previous comment was fixed for swe-bench.
Read the script. We will find a line like:
This will fail because the new location is in evaluation/benchmarks/webarena/ directory. You are a smart LLM, you understand patterns. Same pattern is repeated for these benchmarks, give or take that some have more files than others. Verify each benchmark's shell scripts, and modify accordingly. |
New OpenHands update |
The workflow to fix this issue encountered an error. Please check the workflow logs for more information. |
The previous comments were fixed for shell scripts. After the refactoring of benchmarks from ./evaluation directory to ./evaluation/benchmarks/, it is important that human users are still able to run easily these benchmarks. In every benchmark directory, there should be a README.md file. For example, in ./evaluation/benchmarks/swe-bench, there is a README.md with instructions how to set up and run the swe-bench benchmark. You can read it, and see it has, for example, a line like this:
If the human user copies and pastes that line with their own data, it will fail to run the script, because of course the swe-bench run_infer.sh script has moved to ./evaluation/benchmarks/swe_bench/scripts/run_infer.sh. You're a smart LLM and you know patterns, remember. All these benchmarks are very similar and follow the same patterns for documentation for the human users. You can verify every .md file (not only README, check for more, maybe) in each benchmark and update for this particular move. Keep it minimal, only solve this particular issue. |
New OpenHands update |
The workflow to fix this issue encountered an error. Please check the workflow logs for more information. |
Not all the README.md issue (or other .md) files discussed in the previous comment were updated for human users, after the move of the benchmarks. Some were, some were not. So we need to fix the rest. But we have a problem. You're good, but you were not allowed to use the "str_replace_editor" tool when "old_str" is not unique in the file, so many of your attempted replacements were not performed. You had to go back and include more context. Then they were performed, but you ran out of time. You need to understand this very well, so that we do better this time. Remember, we are refactoring the ./evaluation directory to house every benchmark under ./evaluation/benchmarks. Remember the previous comment about documentation for human users: this is what we fix now. Usually, there were more than one occurrence of the pattern in each file (such as "/evaluation/swe_bench" to be updated to the new location). It is possible there were two occurrences, or more, where one is the syntax of the command to run the benchmark, and another is a particular example of running the command. First, think how to do this better this time. You have two options:
Make a decision, then perform it. Do not ask me about it. |
New OpenHands update |
The workflow to fix this issue encountered an error. Please check the workflow logs for more information. |
You're good! Your choice to run bash scripts was brilliant! You fixed the rest of the documentation for human users in only 4 steps this time. Now. You did very well, and I think all we have left is to double check that there are no leftover old paths. If there are, we need to fix them. Leftovers could be in:
Remember to first look at all benchmarks, as we moved them from ./evaluation to ./evaluation/benchmarks/, so that you know what you work with. Check in order and update as needed. FINALLY:
|
This PR fixes #5222 by reorganizing the evaluation directory structure to improve clarity and maintainability.
Changes
evaluation/benchmarks/
directory to house all ML literature benchmarksutils
,integration_tests
,regression
,static
) directly underevaluation/
Testing
Review Notes
Key files to review:
.github/workflows/eval-runner.yml
- Updated paths for integration tests and benchmarksevaluation/README.md
- Added missing benchmarks and updated pathsTo run this PR locally, use the following command: