Run evaluation on full SWE-Bench #1693

rezzie-rich · 2024-05-10T17:13:31Z

Love the progress so far!

Will you guys test and publish the full swe-bench and the 25% subset test besides just the swe-bench lite?

On auto-code-rover repo, it says 22% on swe-bench lite and 16% on full swe-bench. However, you guys have ACR at 16% on swe-bench lite. Is that the result you guys got or a typo?

rbren · 2024-05-10T17:51:05Z

Thanks for pointing that out! Not sure if it was a typo, or if we were using an old result of theirs.

Let's remove the graph until we can generate a better one

neubig · 2024-05-10T17:53:18Z

This number is from the most recent version of the AutoCodeRover paper! I think we should clarify this in the graph.

frankxu2004 · 2024-05-10T18:00:18Z

From the ACR paper, it shows:

Note that ACR-avg is the comparable number here, as it's the average of 3 runs (meaning that it's pass@1 rate). I see that the 22.33% number inside the repo is the ACR-all, which is the union of 3 runs ( meaning that it's pass@3 rate). I think it's still valid number comparison that for pass@1, ACR is 16%

rbren · 2024-05-10T18:05:28Z

Ahh thank you for the clarification! I will remove my PR

neubig · 2024-05-10T18:08:30Z

So I think the AutoCodeRover is fine as-is, but I agree we should still run on all of SWE-Bench. The main bottleneck for this is time and cost, it costs about $6,000 to run on all of SWE-Bench with GPT-4

libowen2121 · 2024-05-11T06:52:20Z

@rezzie-rich Thank you for the question! As @frankxu2004 clarified, we only report the pass@1 results in the graph. Our evaluation containerization only supports SWE-bench-lite for now and we will extend it to support the full test set!

rezzie-rich · 2024-05-11T16:32:46Z

Gpt-4 is expensive. I think it would be cool if you guys could run the full bench using llama3 70b and 8b as it would give a unique and realistic expectation of running with open llm.

It's hard to compare swe-bench with humans, but as a rule of thumb, an average jr. developer should be able to complete 10-25% while an average sr. developer can 20-40%.

If we can have opendevin complete 25%+ using an open llm (preferably with less than 34b parameters), it's a game changer!

rezzie-rich · 2024-05-11T17:10:56Z

https://evalplus.github.io/leaderboard.html

Leaderboard for open code llm is kinda diluted. However, i found this up to date leaderboard that seems pretty legit.

It has codeQwen-1.5-7b-chat listed above Claude-3-opus right next to gpt-4. Small llm like this should be able to run the bench faster and lot cheaper compared to gpt-4.

If the leaderboard is accurate, it makes codeQwen a valid replacement for gpt-4.

If opendevin can complete 20-25% of the full swe-bench using a 7b model, that would prove the practically and real use case of ai agents in software development.

My thoughts:
Testing the agents on smaller models will also be good for marketing and user satisfaction as well as improve the agents' quality. Since most will try opendevin seeing the gpt-4 results but use it with a local model for budget, it creates an unsatisfying experience. Instead, they could try seeing the local models score and replicate the results, making it more satisfying. Which also leaves room for more performance gain once used with closed llm. It's better to promise less than the offering.

rezzie-rich · 2024-05-18T20:30:35Z

https://chat.lmsys.org/?leaderboard

llama3-70b-instruct is performing better than half of GPT-4 versions. I think it would be great to have benchmarks done using llama3 in the spirit of open-source community while keeping the usage practical.

I know the quant models degrade in performance. however, Q8 models are almost indistinguishable from fp16. a modern performance CPU with 128 GB RAM can easily handle it while keeping it relatively cheaper.

xingyaoww · 2024-05-19T04:39:53Z

@rezzie-rich Good point -- However, Llama-3 only has 8k context window which means it is hardly useful in our agent usecases. I just tested the recent deepseek-V2 MoE - Check results here: https://huggingface.co/spaces/OpenDevin/evaluation

It got ~5% on SWE-Bench lite, and from what i can tell qualitatively, a lot of error cases (~70%) are due to limited context window (32k) of their API. I can only imagine this been way worse on llama-3 due to its 8k window.

rezzie-rich · 2024-05-19T05:31:38Z

https://huggingface.co/gradientai/Llama-3-70B-Instruct-Gradient-1048k

This version has a million context window.

Btw, LOVE the new huggingface space!

xingyaoww · 2024-05-19T05:57:29Z

@rezzie-rich Thanks a ton for sharing!!! Will try to get some GPU and test it right away!!!

BradKML · 2024-06-03T07:37:16Z

Seconding this but not just switching models like @rezzie-rich does (great idea BTW if it can be included into Ollama or some other tool), but also are there any alternative benchmarks for seeing how good they can solve competitive coding problems (or data science problems) for confirmation of quality over mere LLM? Maybe mixing big and small LLMs (e.g. Qwen + LLaMA combo) for added acceleration?

yuntongzhang · 2024-06-24T14:33:35Z

Hi, I'm late to the discussion, but would like to update on the pass@1 score in the original AutoCodeRover paper.

Turns out that the SWE-bench evaluation environment used in our original experiments gives underestimated scores due to missing of system-level dependencies. Some correct patches were deemed as wrong after running the SWE-bench acceptance tests in that environment.

Thanks to the SWE-bench-docker project, our original patches were re-evaluated, and the actual pass@1 score is 19% instead of 16%. More details can be found here. The 19% pass@1 score is also reflected on SWE-bench leaderboard.

github-actions · 2024-07-25T01:50:35Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

0xdevalias · 2024-07-25T08:05:08Z

Shouldn't be stale IMO

github-actions · 2024-08-25T01:56:22Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

xingyaoww · 2024-08-25T01:59:12Z

Some updates: we are making some progress on the infrastructure side - hopefully, we can resolve this in ~2 weeks!

Running OpenHands with 2000 docker containers effecienctly is not an easy task 😓

github-actions · 2024-09-25T01:59:30Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

rezzie-rich added the question Further information is requested label May 10, 2024

rbren mentioned this issue May 10, 2024

Remove SWE-bench lite graph for now #1696

Closed

neubig closed this as completed May 10, 2024

neubig reopened this May 10, 2024

neubig changed the title ~~SWE-bench result~~ Re-run on full SWE-Bench May 10, 2024

neubig changed the title ~~Re-run on full SWE-Bench~~ Run evaluation on full SWE-Bench May 10, 2024

neubig added enhancement New feature or request severity:medium Affecting multiple users and removed question Further information is requested labels May 10, 2024

neubig mentioned this issue May 11, 2024

Implement new agent using AutoCodeRover's approach #942

Closed

xingyaoww self-assigned this May 12, 2024

0xdevalias mentioned this issue May 30, 2024

Submit SWE-bench results to the official leaderboard? #2136

Closed

github-actions bot added the Stale Inactive for 30 days label Jul 25, 2024

xingyaoww removed the Stale Inactive for 30 days label Jul 25, 2024

github-actions bot added the Stale Inactive for 30 days label Aug 25, 2024

xingyaoww removed the Stale Inactive for 30 days label Aug 25, 2024

github-actions bot added the Stale Inactive for 30 days label Sep 25, 2024

enyst removed the Stale Inactive for 30 days label Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run evaluation on full SWE-Bench #1693

Run evaluation on full SWE-Bench #1693

rezzie-rich commented May 10, 2024

rbren commented May 10, 2024

neubig commented May 10, 2024 •

edited

Loading

frankxu2004 commented May 10, 2024 •

edited

Loading

rbren commented May 10, 2024

neubig commented May 10, 2024

libowen2121 commented May 11, 2024

rezzie-rich commented May 11, 2024

rezzie-rich commented May 11, 2024 •

edited

Loading

rezzie-rich commented May 18, 2024 •

edited

Loading

xingyaoww commented May 19, 2024 •

edited

Loading

rezzie-rich commented May 19, 2024 •

edited

Loading

xingyaoww commented May 19, 2024

BradKML commented Jun 3, 2024 •

edited

Loading

yuntongzhang commented Jun 24, 2024

github-actions bot commented Jul 25, 2024

0xdevalias commented Jul 25, 2024

github-actions bot commented Aug 25, 2024

xingyaoww commented Aug 25, 2024 •

edited

Loading

github-actions bot commented Sep 25, 2024

Run evaluation on full SWE-Bench #1693

Run evaluation on full SWE-Bench #1693

Comments

rezzie-rich commented May 10, 2024

rbren commented May 10, 2024

neubig commented May 10, 2024 • edited Loading

frankxu2004 commented May 10, 2024 • edited Loading

rbren commented May 10, 2024

neubig commented May 10, 2024

libowen2121 commented May 11, 2024

rezzie-rich commented May 11, 2024

rezzie-rich commented May 11, 2024 • edited Loading

rezzie-rich commented May 18, 2024 • edited Loading

xingyaoww commented May 19, 2024 • edited Loading

rezzie-rich commented May 19, 2024 • edited Loading

xingyaoww commented May 19, 2024

BradKML commented Jun 3, 2024 • edited Loading

yuntongzhang commented Jun 24, 2024

github-actions bot commented Jul 25, 2024

0xdevalias commented Jul 25, 2024

github-actions bot commented Aug 25, 2024

xingyaoww commented Aug 25, 2024 • edited Loading

github-actions bot commented Sep 25, 2024

neubig commented May 10, 2024 •

edited

Loading

frankxu2004 commented May 10, 2024 •

edited

Loading

rezzie-rich commented May 11, 2024 •

edited

Loading

rezzie-rich commented May 18, 2024 •

edited

Loading

xingyaoww commented May 19, 2024 •

edited

Loading

rezzie-rich commented May 19, 2024 •

edited

Loading

BradKML commented Jun 3, 2024 •

edited

Loading

xingyaoww commented Aug 25, 2024 •

edited

Loading