-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reference to opendevin codeact v1.3 for comparison in the new draft blog post #615
Comments
Thanks for the pointer. Those results look to me like they are for SWE Bench Lite, because each one is listing 300 Has Open Devin published anything about these results? |
I haven't looked outside of their slack/discord/twitter https://discord.gg/ESHStjSjD4, https://join.slack.com/t/opendevin/shared_invite/zt-2i1iqdag6-bVmvamiPA9EZUu7oCO6KhA, ...but i'll post a link to this issue in their slack right now, to encourage them to share their details directly. |
Looks like it is Lite and they posted on X. I will update. |
i suggested they clarify the leaderboard: https://huggingface.co/spaces/OpenDevin/evaluation/discussions/1 |
@rawwerks Thanks for pointing out the confusion! Yes, those scores are for SWE-Bench lite and we've updated the leaderboard for clarity. |
I'm going to close this issue for now, but feel free to add a comment here and I will re-open or file a new issue any time. |
hey @paul-gauthier
first off: huge congrats! swe-bench/experiments#7
re: https://github.com/paul-gauthier/aider/blob/6382153597af092bfdac4ea30104d3243720502e/_posts/2024-05-22-swe-bench-lite.md, i think it's worth mentioning the latest "codeact v1.3" results from the https://github.com/OpenDevin/OpenDevin team:
https://huggingface.co/spaces/OpenDevin/evaluation
I still think aider is the winner, I just wanted to share that this team has beat their prior best, and I would recommend updating your table and the first 2 sentences.
One potential ambiguity is that I don't see their
swe-bench-lite
scores, only the fullswe-bench
. But since lite is a subset of the full one, I don't think it should be too hard to get their official lite score.keep up the amazing work!
The text was updated successfully, but these errors were encountered: