All-Hands-AI · neubig · Nov 25, 2024 · Nov 23, 2024 · Nov 23, 2024 · Nov 23, 2024
diff --git a/.github/workflows/eval-runner.yml b/.github/workflows/eval-runner.yml
@@ -86,12 +86,12 @@ jobs:
           EVAL_DOCKER_IMAGE_PREFIX: us-central1-docker.pkg.dev/evaluation-092424/swe-bench-images
 
         run: |
-          poetry run ./evaluation/swe_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 300 30 $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
+          poetry run ./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 300 30 $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
           OUTPUT_FOLDER=$(find evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite-test/CodeActAgent -name "deepseek-chat_maxiter_50_N_*-no-hint-run_1" -type d | head -n 1)
           echo "OUTPUT_FOLDER for SWE-bench evaluation: $OUTPUT_FOLDER"
-          poetry run ./evaluation/swe_bench/scripts/eval_infer_remote.sh $OUTPUT_FOLDER/output.jsonl $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
+          poetry run ./evaluation/benchmarks/swe_bench/scripts/eval_infer_remote.sh $OUTPUT_FOLDER/output.jsonl $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
 
-          poetry run ./evaluation/swe_bench/scripts/eval/summarize_outputs.py $OUTPUT_FOLDER/output.jsonl > summarize_outputs.log 2>&1
+          poetry run ./evaluation/benchmarks/swe_bench/scripts/eval/summarize_outputs.py $OUTPUT_FOLDER/output.jsonl > summarize_outputs.log 2>&1
           echo "SWEBENCH_REPORT<<EOF" >> $GITHUB_ENV
           cat summarize_outputs.log >> $GITHUB_ENV
           echo "EOF" >> $GITHUB_ENV

diff --git a/...8n/fr/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md b/...8n/fr/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md
@@ -76,7 +76,7 @@ La fonction `run_controller()` est le cœur de l'exécution d'OpenHands. Elle g
 
 ## Le moyen le plus simple de commencer : Explorer les benchmarks existants
 
-Nous vous encourageons à examiner les différents benchmarks d'évaluation disponibles dans le [répertoire `evaluation/`](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation) de notre dépôt.
+Nous vous encourageons à examiner les différents benchmarks d'évaluation disponibles dans le [répertoire `evaluation/benchmarks/`](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks) de notre dépôt.
 
 Pour intégrer votre propre benchmark, nous vous suggérons de commencer par celui qui ressemble le plus à vos besoins. Cette approche peut considérablement rationaliser votre processus d'intégration, vous permettant de vous appuyer sur les structures existantes et de les adapter à vos exigences spécifiques.
 

diff --git a/...-Hans/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md b/...-Hans/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md
@@ -73,7 +73,7 @@ OpenHands 的主要入口点在 `openhands/core/main.py` 中。以下是它工
 
 ## 入门最简单的方法：探索现有基准
 
-我们鼓励您查看我们仓库的 [`evaluation/` 目录](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation)中提供的各种评估基准。
+我们鼓励您查看我们仓库的 [`evaluation/benchmarks/` 目录](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks)中提供的各种评估基准。
 
 要集成您自己的基准，我们建议从最接近您需求的基准开始。这种方法可以显著简化您的集成过程，允许您在现有结构的基础上进行构建并使其适应您的特定要求。
 

diff --git a/docs/modules/usage/how-to/evaluation-harness.md b/docs/modules/usage/how-to/evaluation-harness.md
@@ -73,7 +73,7 @@ The `run_controller()` function is the core of OpenHands's execution. It manages
 
 ## Easiest way to get started: Exploring Existing Benchmarks
 
-We encourage you to review the various evaluation benchmarks available in the [`evaluation/` directory](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation) of our repository.
+We encourage you to review the various evaluation benchmarks available in the [`evaluation/benchmarks/` directory](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks) of our repository.
 
 To integrate your own benchmark, we suggest starting with the one that most closely resembles your needs. This approach can significantly streamline your integration process, allowing you to build upon existing structures and adapt them to your specific requirements.
 

diff --git a/evaluation/README.md b/evaluation/README.md
@@ -46,28 +46,32 @@ The OpenHands evaluation harness supports a wide variety of benchmarks across so
 
 ### Software Engineering
 
-- SWE-Bench: [`evaluation/swe_bench`](./swe_bench)
-- HumanEvalFix: [`evaluation/humanevalfix`](./humanevalfix)
-- BIRD: [`evaluation/bird`](./bird)
-- BioCoder: [`evaluation/ml_bench`](./ml_bench)
-- ML-Bench: [`evaluation/ml_bench`](./ml_bench)
-- APIBench: [`evaluation/gorilla`](./gorilla/)
-- ToolQA: [`evaluation/toolqa`](./toolqa/)
-- AiderBench: [`evaluation/aider_bench`](./aider_bench/)
+- SWE-Bench: [`evaluation/benchmarks/swe_bench`](./benchmarks/swe_bench)
+- HumanEvalFix: [`evaluation/benchmarks/humanevalfix`](./benchmarks/humanevalfix)
+- BIRD: [`evaluation/benchmarks/bird`](./benchmarks/bird)
+- BioCoder: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
+- ML-Bench: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
+- APIBench: [`evaluation/benchmarks/gorilla`](./benchmarks/gorilla/)
+- ToolQA: [`evaluation/benchmarks/toolqa`](./benchmarks/toolqa/)
+- AiderBench: [`evaluation/benchmarks/aider_bench`](./benchmarks/aider_bench/)
+- Commit0: [`evaluation/benchmarks/commit0_bench`](./benchmarks/commit0_bench/)
+- DiscoveryBench: [`evaluation/benchmarks/discoverybench`](./benchmarks/discoverybench/)
 
 ### Web Browsing
 
-- WebArena: [`evaluation/webarena`](./webarena/)
-- MiniWob++: [`evaluation/miniwob`](./miniwob/)
+- WebArena: [`evaluation/benchmarks/webarena`](./benchmarks/webarena/)
+- MiniWob++: [`evaluation/benchmarks/miniwob`](./benchmarks/miniwob/)
+- Browsing Delegation: [`evaluation/benchmarks/browsing_delegation`](./benchmarks/browsing_delegation/)
 
 ### Misc. Assistance
 
-- GAIA: [`evaluation/gaia`](./gaia)
-- GPQA: [`evaluation/gpqa`](./gpqa)
-- AgentBench: [`evaluation/agent_bench`](./agent_bench)
-- MINT: [`evaluation/mint`](./mint)
-- Entity deduction Arena (EDA): [`evaluation/EDA`](./EDA)
-- ProofWriter: [`evaluation/logic_reasoning`](./logic_reasoning)
+- GAIA: [`evaluation/benchmarks/gaia`](./benchmarks/gaia)
+- GPQA: [`evaluation/benchmarks/gpqa`](./benchmarks/gpqa)
+- AgentBench: [`evaluation/benchmarks/agent_bench`](./benchmarks/agent_bench)
+- MINT: [`evaluation/benchmarks/mint`](./benchmarks/mint)
+- Entity deduction Arena (EDA): [`evaluation/benchmarks/EDA`](./benchmarks/EDA)
+- ProofWriter: [`evaluation/benchmarks/logic_reasoning`](./benchmarks/logic_reasoning)
+- ScienceAgentBench: [`evaluation/benchmarks/scienceagentbench`](./benchmarks/scienceagentbench)
 
 ## Result Visualization
 
@@ -79,7 +83,7 @@ You can start your own fork of [our huggingface evaluation outputs](https://hugg
 
 To learn more about how to integrate your benchmark into OpenHands, check out [tutorial here](https://docs.all-hands.dev/modules/usage/how-to/evaluation-harness). Briefly,
 
-- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
+- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/benchmarks/swe_bench` should contain
 all the preprocessing/evaluation/analysis scripts.
 - Raw data and experimental records should not be stored within this repo.
 - For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization.

diff --git a/evaluation/EDA/README.md → evaluation/benchmarks/EDA/README.md b/evaluation/EDA/README.md → evaluation/benchmarks/EDA/README.md
@@ -12,7 +12,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
 
 ```bash
 export OPENAI_API_KEY="sk-XXX"; # This is required for evaluation (to simulate another party of conversation)
-./evaluation/EDA/scripts/run_infer.sh [model_config] [git-version] [agent] [dataset] [eval_limit]
+./evaluation/benchmarks/EDA/scripts/run_infer.sh [model_config] [git-version] [agent] [dataset] [eval_limit]
 ```
 
 where `model_config` is mandatory, while `git-version`, `agent`, `dataset` and `eval_limit` are optional.
@@ -33,7 +33,7 @@ to `CodeActAgent`.
 For example,
 
 ```bash
-./evaluation/EDA/scripts/run_infer.sh eval_gpt4o_2024_05_13 0.6.2 CodeActAgent things
+./evaluation/benchmarks/EDA/scripts/run_infer.sh eval_gpt4o_2024_05_13 0.6.2 CodeActAgent things
 ```
 
 ## Reference

diff --git a/evaluation/EDA/game.py → evaluation/benchmarks/EDA/game.py b/evaluation/EDA/game.py → evaluation/benchmarks/EDA/game.py
diff --git a/evaluation/EDA/run_infer.py → evaluation/benchmarks/EDA/run_infer.py b/evaluation/EDA/run_infer.py → evaluation/benchmarks/EDA/run_infer.py
@@ -4,7 +4,7 @@
 import pandas as pd
 from datasets import load_dataset
 
-from evaluation.EDA.game import Q20Game, Q20GameCelebrity
+from evaluation.benchmarks.EDA.game import Q20Game, Q20GameCelebrity
 from evaluation.utils.shared import (
     EvalMetadata,
     EvalOutput,

diff --git a/evaluation/EDA/scripts/run_infer.sh → ...ation/benchmarks/EDA/scripts/run_infer.sh b/evaluation/EDA/scripts/run_infer.sh → ...ation/benchmarks/EDA/scripts/run_infer.sh
@@ -43,7 +43,7 @@ echo "AGENT_VERSION: $AGENT_VERSION"
 echo "MODEL_CONFIG: $MODEL_CONFIG"
 echo "DATASET: $DATASET"
 
-COMMAND="poetry run python evaluation/EDA/run_infer.py \
+COMMAND="poetry run python evaluation/benchmarks/EDA/run_infer.py \
   --agent-cls $AGENT \
   --llm-config $MODEL_CONFIG \
   --dataset $DATASET \

diff --git a/evaluation/agent_bench/README.md → evaluation/benchmarks/agent_bench/README.md b/evaluation/agent_bench/README.md → evaluation/benchmarks/agent_bench/README.md
@@ -9,7 +9,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
 ## Start the evaluation
 
 ```bash
-./evaluation/agent_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
+./evaluation/benchmarks/agent_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
 ```
 
 - `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
@@ -25,7 +25,7 @@ in order to use `eval_limit`, you must also set `agent`.
 
 Following is the basic command to start the evaluation.
 
-You can update the arguments in the script `evaluation/agent_bench/scripts/run_infer.sh`, such as `--max-iterations`, `--eval-num-workers` and so on.
+You can update the arguments in the script `evaluation/benchmarks/agent_bench/scripts/run_infer.sh`, such as `--max-iterations`, `--eval-num-workers` and so on.
 
 - `--agent-cls`, the agent to use. For example, `CodeActAgent`.
 - `--llm-config`: the LLM configuration to use. For example, `eval_gpt4_1106_preview`.
@@ -34,5 +34,5 @@ You can update the arguments in the script `evaluation/agent_bench/scripts/run_i
 - `--eval-n-limit`: the number of examples to evaluate. For example, `100`.
 
 ```bash
-./evaluation/agent_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 1
+./evaluation/benchmarks/agent_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 1
 ```
diff --git a/evaluation/agent_bench/__init__.py → ...uation/benchmarks/agent_bench/__init__.py b/evaluation/agent_bench/__init__.py → ...uation/benchmarks/agent_bench/__init__.py
diff --git a/evaluation/agent_bench/helper.py → evaluation/benchmarks/agent_bench/helper.py b/evaluation/agent_bench/helper.py → evaluation/benchmarks/agent_bench/helper.py
diff --git a/evaluation/agent_bench/run_infer.py → ...ation/benchmarks/agent_bench/run_infer.py b/evaluation/agent_bench/run_infer.py → ...ation/benchmarks/agent_bench/run_infer.py
@@ -7,7 +7,7 @@
 import pandas as pd
 from datasets import load_dataset
 
-from evaluation.agent_bench.helper import (
+from evaluation.benchmarks.agent_bench.helper import (
     FAKE_RESPONSES,
     INST_SUFFIXES,
     compare_results,

diff --git a/evaluation/agent_bench/scripts/run_infer.sh → ...nchmarks/agent_bench/scripts/run_infer.sh b/evaluation/agent_bench/scripts/run_infer.sh → ...nchmarks/agent_bench/scripts/run_infer.sh
@@ -26,7 +26,7 @@ echo "AGENT: $AGENT"
 echo "AGENT_VERSION: $AGENT_VERSION"
 echo "MODEL_CONFIG: $MODEL_CONFIG"
 
-COMMAND="export PYTHONPATH=evaluation/agent_bench:\$PYTHONPATH && poetry run python evaluation/agent_bench/run_infer.py \
+COMMAND="export PYTHONPATH=evaluation/benchmarks/agent_bench:\$PYTHONPATH && poetry run python evaluation/benchmarks/agent_bench/run_infer.py \
   --agent-cls $AGENT \
   --llm-config $MODEL_CONFIG \
   --max-iterations 30 \

diff --git a/.../agent_bench/scripts/summarise_results.py → .../agent_bench/scripts/summarise_results.py b/.../agent_bench/scripts/summarise_results.py → .../agent_bench/scripts/summarise_results.py
diff --git a/evaluation/aider_bench/README.md → evaluation/benchmarks/aider_bench/README.md b/evaluation/aider_bench/README.md → evaluation/benchmarks/aider_bench/README.md
@@ -16,7 +16,7 @@ development environment and LLM.
 ## Start the evaluation
 
 ```bash
-./evaluation/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]
+./evaluation/benchmarks/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]
 ```
 
 - `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for
@@ -42,7 +42,7 @@ export SKIP_NUM=12 # skip the first 12 instances from the dataset
 Following is the basic command to start the evaluation.
 
 You can update the arguments in the script
-`evaluation/aider_bench/scripts/run_infer.sh`, such as `--max-iterations`,
+`evaluation/benchmarks/aider_bench/scripts/run_infer.sh`, such as `--max-iterations`,
 `--eval-num-workers` and so on:
 
 - `--agent-cls`, the agent to use. For example, `CodeActAgent`.
@@ -53,33 +53,33 @@ You can update the arguments in the script
 - `--eval-ids`: the IDs of the examples to evaluate (comma separated). For example, `"1,3,10"`.
 
 ```bash
-./evaluation/aider_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 100 1 "1,3,10"
+./evaluation/benchmarks/aider_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 100 1 "1,3,10"
 ```
 
 ### Run Inference on `RemoteRuntime` (experimental)
 
 This is in limited beta. Contact Xingyao over slack if you want to try this out!
 
 ```bash
-./evaluation/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]
+./evaluation/benchmarks/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]
 
 # Example - This runs evaluation on CodeActAgent for 133 instances on aider_bench test set, with 2 workers running in parallel
 export ALLHANDS_API_KEY="YOUR-API-KEY"
 export RUNTIME=remote
 export SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev"
-./evaluation/aider_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 133 2
+./evaluation/benchmarks/aider_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 133 2
 ```
 
 ## Summarize Results
 
 ```bash
-poetry run python ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]
+poetry run python ./evaluation/benchmarks/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]
 ```
 
 Full example:
 
 ```bash
-poetry run python ./evaluation/aider_bench/scripts/summarize_results.py evaluation/evaluation_outputs/outputs/AiderBench/CodeActAgent/claude-3-5-sonnet@20240620_maxiter_30_N_v1.9/output.jsonl
+poetry run python ./evaluation/benchmarks/aider_bench/scripts/summarize_results.py evaluation/evaluation_outputs/outputs/AiderBench/CodeActAgent/claude-3-5-sonnet@20240620_maxiter_30_N_v1.9/output.jsonl
 ```
 
 This will list the instances that passed and the instances that failed. For each

diff --git a/evaluation/aider_bench/create_dataset.py → .../benchmarks/aider_bench/create_dataset.py b/evaluation/aider_bench/create_dataset.py → .../benchmarks/aider_bench/create_dataset.py
diff --git a/evaluation/aider_bench/helper.py → evaluation/benchmarks/aider_bench/helper.py b/evaluation/aider_bench/helper.py → evaluation/benchmarks/aider_bench/helper.py
diff --git a/evaluation/aider_bench/run_infer.py → ...ation/benchmarks/aider_bench/run_infer.py b/evaluation/aider_bench/run_infer.py → ...ation/benchmarks/aider_bench/run_infer.py
@@ -7,7 +7,7 @@
 import pandas as pd
 from datasets import load_dataset
 
-from evaluation.aider_bench.helper import (
+from evaluation.benchmarks.aider_bench.helper import (
     FAKE_RESPONSES,
     INST_SUFFIXES,
     INSTRUCTIONS_ADDENDUM,

diff --git a/evaluation/aider_bench/scripts/run_infer.sh → ...nchmarks/aider_bench/scripts/run_infer.sh b/evaluation/aider_bench/scripts/run_infer.sh → ...nchmarks/aider_bench/scripts/run_infer.sh
@@ -39,7 +39,7 @@ if [ "$USE_UNIT_TESTS" = true ]; then
   EVAL_NOTE=$EVAL_NOTE-w-test
 fi
 
-COMMAND="export PYTHONPATH=evaluation/aider_bench:\$PYTHONPATH && poetry run python evaluation/aider_bench/run_infer.py \
+COMMAND="export PYTHONPATH=evaluation/benchmarks/aider_bench:\$PYTHONPATH && poetry run python evaluation/benchmarks/aider_bench/run_infer.py \
   --agent-cls $AGENT \
   --llm-config $MODEL_CONFIG \
   --max-iterations 30 \

diff --git a/.../aider_bench/scripts/summarize_results.py → .../aider_bench/scripts/summarize_results.py b/.../aider_bench/scripts/summarize_results.py → .../aider_bench/scripts/summarize_results.py
diff --git a/evaluation/biocoder/README.md → evaluation/benchmarks/biocoder/README.md b/evaluation/biocoder/README.md → evaluation/benchmarks/biocoder/README.md
@@ -21,7 +21,7 @@ To reproduce this image, please see the Dockerfile_Openopenhands in the `biocode
 
 
 ```bash
-./evaluation/biocoder/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
+./evaluation/benchmarks/biocoder/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
 ```
 
 where `model_config` is mandatory, while `git-version`, `agent`, `dataset` and `eval_limit` are optional.
@@ -43,7 +43,7 @@ with current OpenHands version, then your command would be:
 ## Examples
 
 ```bash
-./evaluation/biocoder/scripts/run_infer.sh eval_gpt4o_2024_05_13 HEAD CodeActAgent 1
+./evaluation/benchmarks/biocoder/scripts/run_infer.sh eval_gpt4o_2024_05_13 HEAD CodeActAgent 1
 ```
 
 ## Reference

diff --git a/evaluation/biocoder/run_infer.py → evaluation/benchmarks/biocoder/run_infer.py b/evaluation/biocoder/run_infer.py → evaluation/benchmarks/biocoder/run_infer.py
@@ -8,7 +8,7 @@
 import pandas as pd
 from datasets import load_dataset
 
-from evaluation.biocoder.utils import BiocoderData
+from evaluation.benchmarks.biocoder.utils import BiocoderData
 from evaluation.utils.shared import (
     EvalMetadata,
     EvalOutput,

diff --git a/evaluation/biocoder/scripts/run_infer.sh → .../benchmarks/biocoder/scripts/run_infer.sh b/evaluation/biocoder/scripts/run_infer.sh → .../benchmarks/biocoder/scripts/run_infer.sh
@@ -28,7 +28,7 @@ echo "AGENT_VERSION: $AGENT_VERSION"
 echo "MODEL_CONFIG: $MODEL_CONFIG"
 echo "DATASET: $DATASET"
 
-COMMAND="poetry run python evaluation/biocoder/run_infer.py \
+COMMAND="poetry run python evaluation/benchmarks/biocoder/run_infer.py \
   --agent-cls $AGENT \
   --llm-config $MODEL_CONFIG \
   --max-iterations 10 \

diff --git a/...ocoder/scripts/setup/copy_changed_code.py → ...ocoder/scripts/setup/copy_changed_code.py b/...ocoder/scripts/setup/copy_changed_code.py → ...ocoder/scripts/setup/copy_changed_code.py
diff --git a/...ion/biocoder/scripts/setup/remove_code.py → ...rks/biocoder/scripts/setup/remove_code.py b/...ion/biocoder/scripts/setup/remove_code.py → ...rks/biocoder/scripts/setup/remove_code.py
diff --git a/evaluation/biocoder/utils.py → evaluation/benchmarks/biocoder/utils.py b/evaluation/biocoder/utils.py → evaluation/benchmarks/biocoder/utils.py
diff --git a/evaluation/bird/README.md → evaluation/benchmarks/bird/README.md b/evaluation/bird/README.md → evaluation/benchmarks/bird/README.md
@@ -9,7 +9,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
 ## Run Inference on Bird
 
 ```bash
-./evaluation/bird/scripts/run_infer.sh [model_config] [git-version]
+./evaluation/benchmarks/bird/scripts/run_infer.sh [model_config] [git-version]
 ```
 
 - `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your

diff --git a/evaluation/bird/__init__.py → evaluation/benchmarks/bird/__init__.py b/evaluation/bird/__init__.py → evaluation/benchmarks/bird/__init__.py
diff --git a/evaluation/bird/run_infer.py → evaluation/benchmarks/bird/run_infer.py b/evaluation/bird/run_infer.py → evaluation/benchmarks/bird/run_infer.py
diff --git a/evaluation/bird/scripts/run_infer.sh → ...tion/benchmarks/bird/scripts/run_infer.sh b/evaluation/bird/scripts/run_infer.sh → ...tion/benchmarks/bird/scripts/run_infer.sh
@@ -26,7 +26,7 @@ echo "AGENT: $AGENT"
 echo "AGENT_VERSION: $AGENT_VERSION"
 echo "MODEL_CONFIG: $MODEL_CONFIG"
 
-COMMAND="poetry run python evaluation/bird/run_infer.py \
+COMMAND="poetry run python evaluation/benchmarks/bird/run_infer.py \
   --agent-cls $AGENT \
   --llm-config $MODEL_CONFIG \
   --max-iterations 5 \

diff --git a/evaluation/browsing_delegation/README.md → .../benchmarks/browsing_delegation/README.md b/evaluation/browsing_delegation/README.md → .../benchmarks/browsing_delegation/README.md
@@ -12,7 +12,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
 ## Run Inference
 
 ```bash
-./evaluation/browsing_delegation/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
+./evaluation/benchmarks/browsing_delegation/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
 # e.g., ./evaluation/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview_llm HEAD CodeActAgent 300
 ```
 

diff --git a/evaluation/browsing_delegation/run_infer.py → ...nchmarks/browsing_delegation/run_infer.py b/evaluation/browsing_delegation/run_infer.py → ...nchmarks/browsing_delegation/run_infer.py
diff --git a/.../browsing_delegation/scripts/run_infer.sh → .../browsing_delegation/scripts/run_infer.sh b/.../browsing_delegation/scripts/run_infer.sh → .../browsing_delegation/scripts/run_infer.sh
@@ -28,7 +28,7 @@ echo "MODEL_CONFIG: $MODEL_CONFIG"
 
 EVAL_NOTE="$AGENT_VERSION"
 
-COMMAND="poetry run python evaluation/browsing_delegation/run_infer.py \
+COMMAND="poetry run python evaluation/benchmarks/browsing_delegation/run_infer.py \
   --agent-cls $AGENT \
   --llm-config $MODEL_CONFIG \
   --max-iterations 1 \

diff --git a/evaluation/commit0_bench/README.md → ...uation/benchmarks/commit0_bench/README.md b/evaluation/commit0_bench/README.md → ...uation/benchmarks/commit0_bench/README.md
@@ -24,10 +24,10 @@ Make sure your Docker daemon is running, and you have ample disk space (at least
 When the `run_infer.sh` script is started, it will automatically pull the `lite` split in Commit0. For example, for instance ID `commit-0/minitorch`, it will try to pull our pre-build docker image `wentingzhao/minitorch` from DockerHub. This image will be used create an OpenHands runtime image where the agent will operate on.
 
 ```bash
-./evaluation/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
+./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
 
 # Example
-./evaluation/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 16 100 8 wentingzhao/commit0_combined test
+./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 16 100 8 wentingzhao/commit0_combined test
 ```
 
 where `model_config` is mandatory, and the rest are optional.
@@ -56,25 +56,25 @@ Let's say you'd like to run 10 instances using `llm.eval_sonnet` and CodeActAgen
 then your command would be:
 
 ```bash
-./evaluation/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
+./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
 ```
 
 ### Run Inference on `RemoteRuntime` (experimental)
 
 This is in limited beta. Contact Xingyao over slack if you want to try this out!
 
 ```bash
-./evaluation/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
+./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
 
 # Example - This runs evaluation on CodeActAgent for 10 instances on "wentingzhao/commit0_combined"'s test set, with max 30 iteration per instances, with 1 number of workers running in parallel
 ALLHANDS_API_KEY="YOUR-API-KEY" RUNTIME=remote SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev" EVAL_DOCKER_IMAGE_PREFIX="docker.io/wentingzhao" \
-./evaluation/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
+./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
 ```
 
 To clean-up all existing runtime you've already started, run:
 
 ```bash
-ALLHANDS_API_KEY="YOUR-API-KEY" ./evaluation/commit0_bench/scripts/cleanup_remote_runtime.sh
+ALLHANDS_API_KEY="YOUR-API-KEY" ./evaluation/benchmarks/commit0_bench/scripts/cleanup_remote_runtime.sh
 ```
 
 ### Specify a subset of tasks to run infer

diff --git a/evaluation/commit0_bench/run_infer.py → ...ion/benchmarks/commit0_bench/run_infer.py b/evaluation/commit0_bench/run_infer.py → ...ion/benchmarks/commit0_bench/run_infer.py
diff --git a/...0_bench/scripts/cleanup_remote_runtime.sh → ...0_bench/scripts/cleanup_remote_runtime.sh b/...0_bench/scripts/cleanup_remote_runtime.sh → ...0_bench/scripts/cleanup_remote_runtime.sh
diff --git a/...uation/commit0_bench/scripts/run_infer.sh → ...hmarks/commit0_bench/scripts/run_infer.sh b/...uation/commit0_bench/scripts/run_infer.sh → ...hmarks/commit0_bench/scripts/run_infer.sh
@@ -91,7 +91,7 @@ fi
 
 function run_eval() {
   local eval_note=$1
-  COMMAND="poetry run python evaluation/commit0_bench/run_infer.py \
+  COMMAND="poetry run python evaluation/benchmarks/commit0_bench/run_infer.py \
     --agent-cls $AGENT \
     --llm-config $MODEL_CONFIG \
     --max-iterations $MAX_ITER \