-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add spider (from big-bench) #1385
Conversation
Nice! Could check that the accuracies we get are in line with what's reported in the literature for at least one model? |
Michi, would you be able to do the result replication that Percy suggested, or would you like some help here? |
Hi Percy and Yifan, I did replication study for Spider using the GPT-3 model used in Big-bench. The results are below. Our numbers (HELM) are similar to those reported by Big-bench in Rouge 1, 2 and L (but are off in BLEU; perhaps there is some difference in the exact type/implementation of BLEU?). Overall, this looks reasonable? Model: openai/davinci
|
Thanks! The differences in the BLEU scores is a bit concerning but it's big enough that it's probably a misconfiguration that should be relatively easy to track down? |
I looked into the BLEU score mismatch in more detail. The takeaways are
From these observations, I think that our Spider run using HELM should be right and our metric calculations should be right too. For other BIG-bench scenarios than Spider, did the BLEU score in HELM match BIG-bench's reported BLEU score? |
We don't have any other BIG-bench results in the official HELM releases, unfortunately. My inclination is to merge this PR as is, but with priority lowered to 3. Then other folks could look into the replication issue later. |
Filed #1699 to investigate this @michiyasunaga could you down priority to 3 and merge? |
Updated the priority and merged. Thanks for the review! |
Based on Percy's request, I added the Spider text-to-SQL task, available through BIG-bench. I checked that it runs.