Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reference to opendevin codeact v1.3 for comparison in the new draft blog post #615

Closed
rawwerks opened this issue May 25, 2024 · 6 comments
Closed

Comments

@rawwerks
Copy link

rawwerks commented May 25, 2024

hey @paul-gauthier

first off: huge congrats! swe-bench/experiments#7

re: https://github.com/paul-gauthier/aider/blob/6382153597af092bfdac4ea30104d3243720502e/_posts/2024-05-22-swe-bench-lite.md, i think it's worth mentioning the latest "codeact v1.3" results from the https://github.com/OpenDevin/OpenDevin team:
https://huggingface.co/spaces/OpenDevin/evaluation

I still think aider is the winner, I just wanted to share that this team has beat their prior best, and I would recommend updating your table and the first 2 sentences.

One potential ambiguity is that I don't see their swe-bench-lite scores, only the full swe-bench. But since lite is a subset of the full one, I don't think it should be too hard to get their official lite score.

keep up the amazing work!

@paul-gauthier
Copy link
Collaborator

Thanks for the pointer. Those results look to me like they are for SWE Bench Lite, because each one is listing 300 total items. I am also unclear what maxiter means. If these are pass@50 results, they are not comparable to aider's pass@1 results.

Has Open Devin published anything about these results?

@rawwerks
Copy link
Author

rawwerks commented May 25, 2024

Has Open Devin published anything about these results?

I haven't looked outside of their slack/discord/twitter https://discord.gg/ESHStjSjD4, https://join.slack.com/t/opendevin/shared_invite/zt-2i1iqdag6-bVmvamiPA9EZUu7oCO6KhA,

...but i'll post a link to this issue in their slack right now, to encourage them to share their details directly.

@paul-gauthier
Copy link
Collaborator

Looks like it is Lite and they posted on X. I will update.

https://x.com/gneubig/status/1791498953709752405

@rawwerks
Copy link
Author

i suggested they clarify the leaderboard: https://huggingface.co/spaces/OpenDevin/evaluation/discussions/1

@xingyaoww
Copy link

@rawwerks Thanks for pointing out the confusion! Yes, those scores are for SWE-Bench lite and we've updated the leaderboard for clarity.

@paul-gauthier
Copy link
Collaborator

I'm going to close this issue for now, but feel free to add a comment here and I will re-open or file a new issue any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants