Skip to content

Commit

Permalink
Minor doc updates
Browse files Browse the repository at this point in the history
  • Loading branch information
john-b-yang committed Jul 29, 2024
1 parent 9802a2c commit c2b3cef
Show file tree
Hide file tree
Showing 8 changed files with 38 additions and 100 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ python -m swebench.harness.run_evaluation \
# use --run_id to name the evaluation run
```

This command will generate docker build logs (`build_image_logs`) and evaluation logs (`run_instance_logs`) in the current directory.
This command will generate docker build logs (`logs/build_images`) and evaluation logs (`logs/run_evaluation`) in the current directory.

The final evaluation results will be stored in the `evaluation_results` directory.

Expand All @@ -116,7 +116,7 @@ Additionally, the SWE-Bench repo can help you:
## 🍎 Tutorials
We've also written the following blog posts on how to use different parts of SWE-bench.
If you'd like to see a post about a particular topic, please let us know via an issue.
* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/collect/collection.md))
* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/assets/collection.md))
* [Nov 6. 2023] Evaluating on SWE-bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/harness/evaluation.md))

## 💫 Contributions
Expand Down
File renamed without changes.
32 changes: 32 additions & 0 deletions assets/evaluation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Evaluating with SWE-bench
John Yang • November 6, 2023

In this tutorial, we will explain how to evaluate models and methods using SWE-bench.

## 🤖 Creating Predictions
For each task instance of the SWE-bench dataset, given an issue (`problem_statement`) + codebase (`repo` + `base_commit`), your model should attempt to write a diff patch prediction. For full details on the SWE-bench task, please refer to Section 2 of the main paper.

Each prediction must be formatted as follows:
```json
{
"instance_id": "<Unique task instance ID>",
"model_patch": "<.patch file content string>",
"model_name_or_path": "<Model name here (i.e. SWE-Llama-13b)>",
}
```

Store multiple predictions in a `.json` file formatted as `[<prediction 1>, <prediction 2>,... <prediction n>]`. It is not necessary to generate predictions for every task instance.

If you'd like examples, the [swe-bench/experiments](https://github.com/swe-bench/experiments) GitHub repository contains many examples of well formed patches.

## 🔄 Running Evaluation
Evaluate model predictions on SWE-bench Lite using the evaluation harness with the following command:
```bash
python -m swebench.harness.run_evaluation \
--dataset_name princeton-nlp/SWE-bench_Lite \
--predictions_path <path_to_predictions> \
--max_workers <num_workers> \
--run_id <run_id>
# use --predictions_path 'gold' to verify the gold patches
# use --run_id to name the evaluation run
```
2 changes: 1 addition & 1 deletion docs/README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ SWE-bench 是一个用于评估大型语言模型的基准,这些模型是从
## 🍎 教程
我们还写了关于如何使用SWE-bench不同部分的博客文章。
如果您想看到关于特定主题的文章,请通过问题告诉我们。
* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/collect/collection.md.md))
* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/assets/collection.md))
* [Nov 6. 2023] Evaluating on SWE-bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/harness/evaluation.md))

## 💫 贡献
Expand Down
2 changes: 1 addition & 1 deletion docs/README_JP.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ SWE-Bench を使用するには、以下のことができます:
## 🍎 チュートリアル
SWE-benchの様々な部分の使い方についても、以下のブログ記事を書いています。
特定のトピックについての投稿を見たい場合は、issueでお知らせください。
* [2023年11月1日] SWE-Benchの評価タスクの収集について ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/collect/collection.md.md))
* [2023年11月1日] SWE-Benchの評価タスクの収集について ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/assets/collection.md))
* [2023年11月6日] SWE-benchでの評価について ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/harness/evaluation.md))

## 💫 貢献
Expand Down
2 changes: 1 addition & 1 deletion docs/README_TW.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ SWE-bench 是一個用於評估大型語言模型的基準,這些模型是從
## 🍎 教程
我們還撰寫了以下有關如何使用SWE-bench不同部分的博客文章。
如果您想看到有關特定主題的文章,請通過問題告訴我們。
* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/collect/collection.md.md))
* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/assets/collection.md))
* [Nov 6. 2023] Evaluating on SWE-bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/harness/evaluation.md))

## 💫 貢獻
Expand Down
2 changes: 1 addition & 1 deletion swebench/collect/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Data Collection
This folder includes the code for the first two parts of the benchmark construction procedure as described in the paper, specifically 1. Repo selection and data scraping, and 2. Attribute-based filtering.

We include a comprehensive [tutorial](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/collect/collection.md) that describes the end-to-end procedure for collecting evaluation task instances from PyPI repositories.
We include a comprehensive [tutorial](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/assets/collection.md) that describes the end-to-end procedure for collecting evaluation task instances from PyPI repositories.

> SWE-bench's collection pipeline is currently designed to target PyPI packages. We hope to expand SWE-bench to more repositories and languages in the future.
Expand Down
94 changes: 0 additions & 94 deletions swebench/harness/evaluation.md

This file was deleted.

0 comments on commit c2b3cef

Please sign in to comment.