diff --git a/.gitmodules b/.gitmodules deleted file mode 100644 index e69de29b..00000000 diff --git a/CHANGELOG.md b/CHANGELOG.md index 224cfe7b..b50293c9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,27 @@ All notable changes to the PyPI package for SWE-bench ([`swebench`](https://pypi Prior to version 1.1.0, not all deployed versions are listed, as the PyPI package was going through development and testing. The noteworthy versions and the respective changes that were introduced by that version are included. All versions 1.1.0 onwards are fully listed. +## [2.0.3] - 7/2/2024 +* #149 Interface fix: run_id is required +* #151 Fix: Support JSON datasets (avoid loading json twice) +* #152 Add very simple CI +* #153 Various nitpicks +* #155 Fix link to collection tutorial +* #161 Fix path to image in docs +* #162 Fix evaluation hanging issue and improve patch apply +* #164 Fix so it doesn't crash when no env imgs to build +* #166 Fix newline outputs for django's log parser +* #168 Update reporting and skip empty model patch predictions + +## [2.0.0] - 6/27/2024 +Major release - the SWE-bench evaluation harness has been upgraded to incorporate containerized, sandboxed execution environments based on Docker. There are several chances to the API resulting from this: +* Removal of the `swebench.metrics` module +* Updates to the API of `swebench.harness` functionality +* Significant modifications to underlying evaluation logic +* Minor updates to installation specifications for different repos + versions. + +Read the full report [here](https://github.com/princeton-nlp/SWE-bench/tree/main/docs/20240627_docker) + ## [1.1.5] - 5/15/2024 * Add support for HumanEvalFix (Python, JS, Go, Java) ([source](https://huggingface.co/datasets/bigcode/humanevalpack)) diff --git a/README.md b/README.md index 43851265..03ca1143 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@
@@ -39,7 +39,7 @@ Please refer our [website](http://swe-bench.github.io) for the public leaderboar SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a *codebase* and an *issue*, a language model is tasked with generating a *patch* that resolves the described problem. - + To access SWE-bench, copy and run the following code: ```python diff --git a/build_deploy.sh b/assets/build_deploy.sh similarity index 100% rename from build_deploy.sh rename to assets/build_deploy.sh diff --git a/assets/collection.png b/assets/figures/collection.png similarity index 100% rename from assets/collection.png rename to assets/figures/collection.png diff --git a/assets/evaluation.png b/assets/figures/evaluation.png similarity index 100% rename from assets/evaluation.png rename to assets/figures/evaluation.png diff --git a/assets/swellama_banner.png b/assets/figures/swellama_banner.png similarity index 100% rename from assets/swellama_banner.png rename to assets/figures/swellama_banner.png diff --git a/assets/teaser.png b/assets/figures/teaser.png similarity index 100% rename from assets/teaser.png rename to assets/figures/teaser.png diff --git a/assets/validation.png b/assets/figures/validation.png similarity index 100% rename from assets/validation.png rename to assets/figures/validation.png diff --git a/docs/README_CN.md b/docs/README_CN.md index 3bf77b65..65b00ad6 100644 --- a/docs/README_CN.md +++ b/docs/README_CN.md @@ -1,6 +1,6 @@ @@ -33,7 +33,7 @@ SWE-bench 是一个用于评估大型语言模型的基准,这些模型是从 GitHub 收集的真实软件问题。 给定一个 *代码库* 和一个 *问题*,语言模型的任务是生成一个 *补丁* 来解决所描述的问题。 - + ## 🚀 设置 要从源代码构建 SWE-bench,请按照以下步骤操作: diff --git a/docs/README_JP.md b/docs/README_JP.md index 6bbdfde7..505b0456 100644 --- a/docs/README_JP.md +++ b/docs/README_JP.md @@ -4,7 +4,7 @@