Release 2.0.3

princeton-nlp · Jul 2, 2024 · 4498af9 · 4498af9
1 parent 8198707
commit 4498af9
Show file tree

Hide file tree

Showing 17 changed files with 33 additions and 32 deletions.
diff --git a/.gitmodules b/.gitmodules
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,27 @@ All notable changes to the PyPI package for SWE-bench ([`swebench`](https://pypi
 
 Prior to version 1.1.0, not all deployed versions are listed, as the PyPI package was going through development and testing. The noteworthy versions and the respective changes that were introduced by that version are included. All versions 1.1.0 onwards are fully listed.
 
+## [2.0.3] - 7/2/2024
+* #149 Interface fix: run_id is required
+* #151 Fix: Support JSON datasets (avoid loading json twice)
+* #152 Add very simple CI
+* #153 Various nitpicks
+* #155 Fix link to collection tutorial
+* #161 Fix path to image in docs
+* #162 Fix evaluation hanging issue and improve patch apply
+* #164 Fix so it doesn't crash when no env imgs to build
+* #166 Fix newline outputs for django's log parser
+* #168 Update reporting and skip empty model patch predictions
+
+## [2.0.0] - 6/27/2024
+Major release - the SWE-bench evaluation harness has been upgraded to incorporate containerized, sandboxed execution environments based on Docker. There are several chances to the API resulting from this:
+* Removal of the `swebench.metrics` module
+* Updates to the API of `swebench.harness` functionality
+* Significant modifications to underlying evaluation logic
+* Minor updates to installation specifications for different repos + versions.
+
+Read the full report [here](https://github.com/princeton-nlp/SWE-bench/tree/main/docs/20240627_docker)
+
 ## [1.1.5] - 5/15/2024
 * Add support for HumanEvalFix (Python, JS, Go, Java) ([source](https://huggingface.co/datasets/bigcode/humanevalpack))
 

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 <p align="center">
   <a href="https://github.com/princeton-nlp/Llamao">
-    <img src="assets/swellama_banner.png" width="50%" alt="Kawi the SWE-Llama" />
+    <img src="assets/figures/swellama_banner.png" width="50%" alt="Kawi the SWE-Llama" />
   </a>
 </p>
 
@@ -39,7 +39,7 @@ Please refer our [website](http://swe-bench.github.io) for the public leaderboar
 SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub.
 Given a *codebase* and an *issue*, a language model is tasked with generating a *patch* that resolves the described problem.
 
-<img src="assets/teaser.png">
+<img src="assets/figures/teaser.png">
 
 To access SWE-bench, copy and run the following code:
 ```python

diff --git a/build_deploy.sh → assets/build_deploy.sh b/build_deploy.sh → assets/build_deploy.sh
diff --git a/assets/collection.png → assets/figures/collection.png b/assets/collection.png → assets/figures/collection.png
diff --git a/assets/evaluation.png → assets/figures/evaluation.png b/assets/evaluation.png → assets/figures/evaluation.png
diff --git a/assets/swellama_banner.png → assets/figures/swellama_banner.png b/assets/swellama_banner.png → assets/figures/swellama_banner.png
diff --git a/assets/teaser.png → assets/figures/teaser.png b/assets/teaser.png → assets/figures/teaser.png
diff --git a/assets/validation.png → assets/figures/validation.png b/assets/validation.png → assets/figures/validation.png
diff --git a/docs/README_CN.md b/docs/README_CN.md
@@ -1,6 +1,6 @@
 <p align="center">
   <a href="https://github.com/princeton-nlp/Llamao">
-    <img src="assets/swellama_banner.png" width="50%" alt="Kawi the SWE-Llama" />
+    <img src="assets/figures/swellama_banner.png" width="50%" alt="Kawi the SWE-Llama" />
   </a>
 </p>
 
@@ -33,7 +33,7 @@
 SWE-bench 是一个用于评估大型语言模型的基准，这些模型是从 GitHub 收集的真实软件问题。
 给定一个 *代码库* 和一个 *问题*，语言模型的任务是生成一个 *补丁* 来解决所描述的问题。
 
-<img src="assets/teaser.png">
+<img src="assets/figures/teaser.png">
 
 ## 🚀 设置
 要从源代码构建 SWE-bench，请按照以下步骤操作:

diff --git a/docs/README_JP.md b/docs/README_JP.md
@@ -4,7 +4,7 @@
 <document_content>
 <p align="center">
   <a href="https://github.com/princeton-nlp/Llamao">
-    <img src="https://raw.githubusercontent.com/Sunwood-ai-labs/SWE-bench/main/assets/swellama_banner.png" width="50%" alt="Kawi the SWE-Llama" />
+    <img src="https://raw.githubusercontent.com/Sunwood-ai-labs/SWE-bench/main/assets/figures/swellama_banner.png" width="50%" alt="Kawi the SWE-Llama" />
   </a>
 </p>
 
@@ -34,7 +34,7 @@ ICLR 2024 の論文 <a href="http://swe-bench.github.io/paper.pdf">SWE-bench: Ca
 SWE-bench は、GitHub から収集された実世界のソフトウェアの課題に関する大規模言語モデルを評価するためのベンチマークです。
 *コードベース*と*イシュー*が与えられ、言語モデルは記述された問題を解決する*パッチ*を生成するタスクを行います。
 
-<img src="https://raw.githubusercontent.com/Sunwood-ai-labs/SWE-bench/main/assets/teaser.png">
+<img src="https://raw.githubusercontent.com/Sunwood-ai-labs/SWE-bench/main/assets/figures/teaser.png">
 
 ## 🚀 セットアップ
 SWE-bench をソースからビルドするには、以下の手順に従ってください:

diff --git a/docs/README_TW.md b/docs/README_TW.md
@@ -1,6 +1,6 @@
 <p align="center">
   <a href="https://github.com/princeton-nlp/Llamao">
-    <img src="assets/swellama_banner.png" width="50%" alt="Kawi the SWE-Llama" />
+    <img src="assets/figures/swellama_banner.png" width="50%" alt="Kawi the SWE-Llama" />
   </a>
 </p>
 
@@ -33,7 +33,7 @@
 SWE-bench 是一個用於評估大型語言模型的基準，這些模型是從 GitHub 收集的真實軟體問題。
 給定一個 *代碼庫* 和一個 *問題*，語言模型的任務是生成一個 *修補程式* 來解決所描述的問題。
 
-<img src="assets/teaser.png">
+<img src="assets/figures/teaser.png">
 
 ## 🚀 設置
 要從源代碼構建 SWE-bench，請按照以下步驟操作:

diff --git a/environment.yml b/environment.yml
diff --git a/swebench/__init__.py b/swebench/__init__.py
@@ -1,4 +1,4 @@
-__version__ = "2.0.2"
+__version__ = "2.0.3"
 
 from swebench.collect.build_dataset import main as build_dataset
 from swebench.collect.get_tasks_pipeline import main as get_tasks_pipeline

diff --git a/swebench/collect/README.md b/swebench/collect/README.md
@@ -5,7 +5,7 @@ We include a comprehensive [tutorial](https://github.com/princeton-nlp/SWE-bench
 
 > SWE-bench's collection pipeline is currently designed to target PyPI packages. We hope to expand SWE-bench to more repositories and languages in the future.
 
-<img src="../../assets/collection.png">
+<img src="../../assets/figures/collection.png">
 
 ## Collection Procedure
 To run collection on your own repositories, run the `run_get_tasks_pipeline.sh` script. Given a repository or list of repositories (formatted as `owner/name`), for each repository this command will generate...

diff --git a/swebench/collect/collection.md b/swebench/collect/collection.md
@@ -6,7 +6,7 @@ In this tutorial, we explain how to use the SWE-Bench repository to collect eval
 > SWE-bench's collection pipeline is currently designed to target PyPI packages. We hope to expand SWE-bench to more repositories and languages in the future.
 
 <div align="center">
-    <img style="width:70%" src="../assets/collection.png">
+    <img style="width:70%" src="../assets/figures/collection.png">
 </div>
 
 ## 🔍 Selecting a Repository

diff --git a/swebench/harness/evaluation.md b/swebench/harness/evaluation.md
@@ -33,7 +33,7 @@ python run_evaluation.py \
 Additional arguments are defined in `run_evaluation.py`. The following diagram captures, at a high level, what `run_evaluation.py` does. More details are provided in `harness/` and the Appendix of the main paper.
 
 <div align="center">
-    <img style="width:70%" src="../../assets/evaluation.png">
+    <img style="width:70%" src="../../assets/figures/evaluation.png">
 </div>
 
 ## 📈 Metrics