Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

Merged
merged 11 commits into from
Nov 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/eval-runner.yml
Original file line number Diff line number Diff line change
Expand Up @@ -86,12 +86,12 @@ jobs:
EVAL_DOCKER_IMAGE_PREFIX: us-central1-docker.pkg.dev/evaluation-092424/swe-bench-images

run: |
poetry run ./evaluation/swe_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 300 30 $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
poetry run ./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 300 30 $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
OUTPUT_FOLDER=$(find evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite-test/CodeActAgent -name "deepseek-chat_maxiter_50_N_*-no-hint-run_1" -type d | head -n 1)
echo "OUTPUT_FOLDER for SWE-bench evaluation: $OUTPUT_FOLDER"
poetry run ./evaluation/swe_bench/scripts/eval_infer_remote.sh $OUTPUT_FOLDER/output.jsonl $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
poetry run ./evaluation/benchmarks/swe_bench/scripts/eval_infer_remote.sh $OUTPUT_FOLDER/output.jsonl $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test

poetry run ./evaluation/swe_bench/scripts/eval/summarize_outputs.py $OUTPUT_FOLDER/output.jsonl > summarize_outputs.log 2>&1
poetry run ./evaluation/benchmarks/swe_bench/scripts/eval/summarize_outputs.py $OUTPUT_FOLDER/output.jsonl > summarize_outputs.log 2>&1
echo "SWEBENCH_REPORT<<EOF" >> $GITHUB_ENV
cat summarize_outputs.log >> $GITHUB_ENV
echo "EOF" >> $GITHUB_ENV
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ La fonction `run_controller()` est le cœur de l'exécution d'OpenHands. Elle g

## Le moyen le plus simple de commencer : Explorer les benchmarks existants

Nous vous encourageons à examiner les différents benchmarks d'évaluation disponibles dans le [répertoire `evaluation/`](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation) de notre dépôt.
Nous vous encourageons à examiner les différents benchmarks d'évaluation disponibles dans le [répertoire `evaluation/benchmarks/`](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks) de notre dépôt.

Pour intégrer votre propre benchmark, nous vous suggérons de commencer par celui qui ressemble le plus à vos besoins. Cette approche peut considérablement rationaliser votre processus d'intégration, vous permettant de vous appuyer sur les structures existantes et de les adapter à vos exigences spécifiques.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ OpenHands 的主要入口点在 `openhands/core/main.py` 中。以下是它工

## 入门最简单的方法:探索现有基准

我们鼓励您查看我们仓库的 [`evaluation/` 目录](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation)中提供的各种评估基准。
我们鼓励您查看我们仓库的 [`evaluation/benchmarks/` 目录](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks)中提供的各种评估基准。

要集成您自己的基准,我们建议从最接近您需求的基准开始。这种方法可以显著简化您的集成过程,允许您在现有结构的基础上进行构建并使其适应您的特定要求。

Expand Down
2 changes: 1 addition & 1 deletion docs/modules/usage/how-to/evaluation-harness.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ The `run_controller()` function is the core of OpenHands's execution. It manages

## Easiest way to get started: Exploring Existing Benchmarks

We encourage you to review the various evaluation benchmarks available in the [`evaluation/` directory](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation) of our repository.
We encourage you to review the various evaluation benchmarks available in the [`evaluation/benchmarks/` directory](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks) of our repository.

To integrate your own benchmark, we suggest starting with the one that most closely resembles your needs. This approach can significantly streamline your integration process, allowing you to build upon existing structures and adapt them to your specific requirements.

Expand Down
38 changes: 21 additions & 17 deletions evaluation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,28 +46,32 @@ The OpenHands evaluation harness supports a wide variety of benchmarks across so

### Software Engineering

- SWE-Bench: [`evaluation/swe_bench`](./swe_bench)
- HumanEvalFix: [`evaluation/humanevalfix`](./humanevalfix)
- BIRD: [`evaluation/bird`](./bird)
- BioCoder: [`evaluation/ml_bench`](./ml_bench)
- ML-Bench: [`evaluation/ml_bench`](./ml_bench)
- APIBench: [`evaluation/gorilla`](./gorilla/)
- ToolQA: [`evaluation/toolqa`](./toolqa/)
- AiderBench: [`evaluation/aider_bench`](./aider_bench/)
- SWE-Bench: [`evaluation/benchmarks/swe_bench`](./benchmarks/swe_bench)
neubig marked this conversation as resolved.
Show resolved Hide resolved
- HumanEvalFix: [`evaluation/benchmarks/humanevalfix`](./benchmarks/humanevalfix)
- BIRD: [`evaluation/benchmarks/bird`](./benchmarks/bird)
- BioCoder: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
- ML-Bench: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
- APIBench: [`evaluation/benchmarks/gorilla`](./benchmarks/gorilla/)
- ToolQA: [`evaluation/benchmarks/toolqa`](./benchmarks/toolqa/)
- AiderBench: [`evaluation/benchmarks/aider_bench`](./benchmarks/aider_bench/)
- Commit0: [`evaluation/benchmarks/commit0_bench`](./benchmarks/commit0_bench/)
- DiscoveryBench: [`evaluation/benchmarks/discoverybench`](./benchmarks/discoverybench/)

### Web Browsing

- WebArena: [`evaluation/webarena`](./webarena/)
- MiniWob++: [`evaluation/miniwob`](./miniwob/)
- WebArena: [`evaluation/benchmarks/webarena`](./benchmarks/webarena/)
- MiniWob++: [`evaluation/benchmarks/miniwob`](./benchmarks/miniwob/)
- Browsing Delegation: [`evaluation/benchmarks/browsing_delegation`](./benchmarks/browsing_delegation/)

### Misc. Assistance

- GAIA: [`evaluation/gaia`](./gaia)
- GPQA: [`evaluation/gpqa`](./gpqa)
- AgentBench: [`evaluation/agent_bench`](./agent_bench)
- MINT: [`evaluation/mint`](./mint)
- Entity deduction Arena (EDA): [`evaluation/EDA`](./EDA)
- ProofWriter: [`evaluation/logic_reasoning`](./logic_reasoning)
- GAIA: [`evaluation/benchmarks/gaia`](./benchmarks/gaia)
- GPQA: [`evaluation/benchmarks/gpqa`](./benchmarks/gpqa)
- AgentBench: [`evaluation/benchmarks/agent_bench`](./benchmarks/agent_bench)
- MINT: [`evaluation/benchmarks/mint`](./benchmarks/mint)
- Entity deduction Arena (EDA): [`evaluation/benchmarks/EDA`](./benchmarks/EDA)
- ProofWriter: [`evaluation/benchmarks/logic_reasoning`](./benchmarks/logic_reasoning)
- ScienceAgentBench: [`evaluation/benchmarks/scienceagentbench`](./benchmarks/scienceagentbench)

## Result Visualization

Expand All @@ -79,7 +83,7 @@ You can start your own fork of [our huggingface evaluation outputs](https://hugg

To learn more about how to integrate your benchmark into OpenHands, check out [tutorial here](https://docs.all-hands.dev/modules/usage/how-to/evaluation-harness). Briefly,

- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/benchmarks/swe_bench` should contain
all the preprocessing/evaluation/analysis scripts.
- Raw data and experimental records should not be stored within this repo.
- For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop

```bash
export OPENAI_API_KEY="sk-XXX"; # This is required for evaluation (to simulate another party of conversation)
./evaluation/EDA/scripts/run_infer.sh [model_config] [git-version] [agent] [dataset] [eval_limit]
./evaluation/benchmarks/EDA/scripts/run_infer.sh [model_config] [git-version] [agent] [dataset] [eval_limit]
```

where `model_config` is mandatory, while `git-version`, `agent`, `dataset` and `eval_limit` are optional.
Expand All @@ -33,7 +33,7 @@ to `CodeActAgent`.
For example,

```bash
./evaluation/EDA/scripts/run_infer.sh eval_gpt4o_2024_05_13 0.6.2 CodeActAgent things
./evaluation/benchmarks/EDA/scripts/run_infer.sh eval_gpt4o_2024_05_13 0.6.2 CodeActAgent things
```

## Reference
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import pandas as pd
from datasets import load_dataset

from evaluation.EDA.game import Q20Game, Q20GameCelebrity
from evaluation.benchmarks.EDA.game import Q20Game, Q20GameCelebrity
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
echo "DATASET: $DATASET"

COMMAND="poetry run python evaluation/EDA/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/EDA/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--dataset $DATASET \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
## Start the evaluation

```bash
./evaluation/agent_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
./evaluation/benchmarks/agent_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
```

- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
Expand All @@ -25,7 +25,7 @@ in order to use `eval_limit`, you must also set `agent`.

Following is the basic command to start the evaluation.

You can update the arguments in the script `evaluation/agent_bench/scripts/run_infer.sh`, such as `--max-iterations`, `--eval-num-workers` and so on.
You can update the arguments in the script `evaluation/benchmarks/agent_bench/scripts/run_infer.sh`, such as `--max-iterations`, `--eval-num-workers` and so on.

- `--agent-cls`, the agent to use. For example, `CodeActAgent`.
- `--llm-config`: the LLM configuration to use. For example, `eval_gpt4_1106_preview`.
Expand All @@ -34,5 +34,5 @@ You can update the arguments in the script `evaluation/agent_bench/scripts/run_i
- `--eval-n-limit`: the number of examples to evaluate. For example, `100`.

```bash
./evaluation/agent_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 1
./evaluation/benchmarks/agent_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 1
```
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import pandas as pd
from datasets import load_dataset

from evaluation.agent_bench.helper import (
from evaluation.benchmarks.agent_bench.helper import (
FAKE_RESPONSES,
INST_SUFFIXES,
compare_results,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"

COMMAND="export PYTHONPATH=evaluation/agent_bench:\$PYTHONPATH && poetry run python evaluation/agent_bench/run_infer.py \
COMMAND="export PYTHONPATH=evaluation/benchmarks/agent_bench:\$PYTHONPATH && poetry run python evaluation/benchmarks/agent_bench/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 30 \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ development environment and LLM.
## Start the evaluation

```bash
./evaluation/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]
./evaluation/benchmarks/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]
```

- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for
Expand All @@ -42,7 +42,7 @@ export SKIP_NUM=12 # skip the first 12 instances from the dataset
Following is the basic command to start the evaluation.

You can update the arguments in the script
`evaluation/aider_bench/scripts/run_infer.sh`, such as `--max-iterations`,
`evaluation/benchmarks/aider_bench/scripts/run_infer.sh`, such as `--max-iterations`,
`--eval-num-workers` and so on:

- `--agent-cls`, the agent to use. For example, `CodeActAgent`.
Expand All @@ -53,33 +53,33 @@ You can update the arguments in the script
- `--eval-ids`: the IDs of the examples to evaluate (comma separated). For example, `"1,3,10"`.

```bash
./evaluation/aider_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 100 1 "1,3,10"
./evaluation/benchmarks/aider_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 100 1 "1,3,10"
```

### Run Inference on `RemoteRuntime` (experimental)

This is in limited beta. Contact Xingyao over slack if you want to try this out!

```bash
./evaluation/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]
./evaluation/benchmarks/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]

# Example - This runs evaluation on CodeActAgent for 133 instances on aider_bench test set, with 2 workers running in parallel
export ALLHANDS_API_KEY="YOUR-API-KEY"
export RUNTIME=remote
export SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev"
./evaluation/aider_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 133 2
./evaluation/benchmarks/aider_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 133 2
```

## Summarize Results

```bash
poetry run python ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]
poetry run python ./evaluation/benchmarks/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]
```

Full example:

```bash
poetry run python ./evaluation/aider_bench/scripts/summarize_results.py evaluation/evaluation_outputs/outputs/AiderBench/CodeActAgent/claude-3-5-sonnet@20240620_maxiter_30_N_v1.9/output.jsonl
poetry run python ./evaluation/benchmarks/aider_bench/scripts/summarize_results.py evaluation/evaluation_outputs/outputs/AiderBench/CodeActAgent/claude-3-5-sonnet@20240620_maxiter_30_N_v1.9/output.jsonl
```

This will list the instances that passed and the instances that failed. For each
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import pandas as pd
from datasets import load_dataset

from evaluation.aider_bench.helper import (
from evaluation.benchmarks.aider_bench.helper import (
FAKE_RESPONSES,
INST_SUFFIXES,
INSTRUCTIONS_ADDENDUM,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ if [ "$USE_UNIT_TESTS" = true ]; then
EVAL_NOTE=$EVAL_NOTE-w-test
fi

COMMAND="export PYTHONPATH=evaluation/aider_bench:\$PYTHONPATH && poetry run python evaluation/aider_bench/run_infer.py \
COMMAND="export PYTHONPATH=evaluation/benchmarks/aider_bench:\$PYTHONPATH && poetry run python evaluation/benchmarks/aider_bench/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 30 \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ To reproduce this image, please see the Dockerfile_Openopenhands in the `biocode


```bash
./evaluation/biocoder/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
./evaluation/benchmarks/biocoder/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
```

where `model_config` is mandatory, while `git-version`, `agent`, `dataset` and `eval_limit` are optional.
Expand All @@ -43,7 +43,7 @@ with current OpenHands version, then your command would be:
## Examples

```bash
./evaluation/biocoder/scripts/run_infer.sh eval_gpt4o_2024_05_13 HEAD CodeActAgent 1
./evaluation/benchmarks/biocoder/scripts/run_infer.sh eval_gpt4o_2024_05_13 HEAD CodeActAgent 1
```

## Reference
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import pandas as pd
from datasets import load_dataset

from evaluation.biocoder.utils import BiocoderData
from evaluation.benchmarks.biocoder.utils import BiocoderData
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
echo "DATASET: $DATASET"

COMMAND="poetry run python evaluation/biocoder/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/biocoder/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 10 \
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
## Run Inference on Bird

```bash
./evaluation/bird/scripts/run_infer.sh [model_config] [git-version]
./evaluation/benchmarks/bird/scripts/run_infer.sh [model_config] [git-version]
```

- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
Expand Down
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"

COMMAND="poetry run python evaluation/bird/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/bird/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 5 \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
## Run Inference

```bash
./evaluation/browsing_delegation/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
./evaluation/benchmarks/browsing_delegation/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
# e.g., ./evaluation/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview_llm HEAD CodeActAgent 300
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ echo "MODEL_CONFIG: $MODEL_CONFIG"

EVAL_NOTE="$AGENT_VERSION"

COMMAND="poetry run python evaluation/browsing_delegation/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/browsing_delegation/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 1 \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,10 @@ Make sure your Docker daemon is running, and you have ample disk space (at least
When the `run_infer.sh` script is started, it will automatically pull the `lite` split in Commit0. For example, for instance ID `commit-0/minitorch`, it will try to pull our pre-build docker image `wentingzhao/minitorch` from DockerHub. This image will be used create an OpenHands runtime image where the agent will operate on.

```bash
./evaluation/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]

# Example
./evaluation/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 16 100 8 wentingzhao/commit0_combined test
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 16 100 8 wentingzhao/commit0_combined test
```

where `model_config` is mandatory, and the rest are optional.
Expand Down Expand Up @@ -56,25 +56,25 @@ Let's say you'd like to run 10 instances using `llm.eval_sonnet` and CodeActAgen
then your command would be:

```bash
./evaluation/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
```

### Run Inference on `RemoteRuntime` (experimental)

This is in limited beta. Contact Xingyao over slack if you want to try this out!

```bash
./evaluation/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]

# Example - This runs evaluation on CodeActAgent for 10 instances on "wentingzhao/commit0_combined"'s test set, with max 30 iteration per instances, with 1 number of workers running in parallel
ALLHANDS_API_KEY="YOUR-API-KEY" RUNTIME=remote SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev" EVAL_DOCKER_IMAGE_PREFIX="docker.io/wentingzhao" \
./evaluation/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
```

To clean-up all existing runtime you've already started, run:

```bash
ALLHANDS_API_KEY="YOUR-API-KEY" ./evaluation/commit0_bench/scripts/cleanup_remote_runtime.sh
ALLHANDS_API_KEY="YOUR-API-KEY" ./evaluation/benchmarks/commit0_bench/scripts/cleanup_remote_runtime.sh
```

### Specify a subset of tasks to run infer
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ fi

function run_eval() {
local eval_note=$1
COMMAND="poetry run python evaluation/commit0_bench/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/commit0_bench/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations $MAX_ITER \
Expand Down
Loading
Loading