Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add spider (from big-bench) #1385

Merged
merged 3 commits into from
Jun 28, 2023
Merged

add spider (from big-bench) #1385

merged 3 commits into from
Jun 28, 2023

Conversation

michiyasunaga
Copy link
Member

Based on Percy's request, I added the Spider text-to-SQL task, available through BIG-bench. I checked that it runs.

@percyliang percyliang requested a review from yifanmai February 23, 2023 07:26
@percyliang
Copy link
Contributor

Nice! Could check that the accuracies we get are in line with what's reported in the literature for at least one model?

@yifanmai
Copy link
Collaborator

Michi, would you be able to do the result replication that Percy suggested, or would you like some help here?

@michiyasunaga
Copy link
Member Author

michiyasunaga commented Apr 26, 2023

Hi Percy and Yifan,

I did replication study for Spider using the GPT-3 model used in Big-bench. The results are below. Our numbers (HELM) are similar to those reported by Big-bench in Rouge 1, 2 and L (but are off in BLEU; perhaps there is some difference in the exact type/implementation of BLEU?).

Overall, this looks reasonable?

Model: openai/davinci

Rouge 1 Rouge 2 Rouge L BLEU
HELM (0-shot) 0.29 0.14 0.27 0.10
HELM (1-shot) 0.31 0.16 0.28 0.17
HELM (2-shot) 0.33 0.17 0.30 0.18
HELM (3-shot) 0.33 0.17 0.30 0.17
Big-bench reported (0-shot) 0.27 0.11 0.26 0.67
Big-bench reported (1-shot) 0.32 0.13 0.32 1.93
Big-bench reported (2-shot) 0.34 0.15 0.34 1.47
Big-bench reported (3-shot) 0.34 0.16 0.34 1.11

@percyliang
Copy link
Contributor

Thanks! The differences in the BLEU scores is a bit concerning but it's big enough that it's probably a misconfiguration that should be relatively easy to track down?

@michiyasunaga
Copy link
Member Author

I looked into the BLEU score mismatch in more detail.
Specifically, I took the predictions from our HELM Spider run, and tried evaluating the ROUGE and BLEU scores using both the HELM's metric calculation code and BIG-bench's metric calculation code (for details/reproduction, this notebook: https://colab.research.google.com/drive/1cOeMalGd9zbw51nT6HixtBPxaGqJY3i-).

The takeaways are

  • For both ROUGE and BLEU, the results roughly match between HELM's metric calculation code and BIG-bench's metric calculation code
  • Our ROUGE score matches the BIG-bench's reported ROUGE numbers (result table in the above comment)

From these observations, I think that our Spider run using HELM should be right and our metric calculations should be right too.
So perhaps there might be some mistake in the BIG-bench's reported BLEU numbers..?

For other BIG-bench scenarios than Spider, did the BLEU score in HELM match BIG-bench's reported BLEU score?

@yifanmai
Copy link
Collaborator

yifanmai commented Jun 5, 2023

We don't have any other BIG-bench results in the official HELM releases, unfortunately.

My inclination is to merge this PR as is, but with priority lowered to 3. Then other folks could look into the replication issue later.

@yifanmai
Copy link
Collaborator

Filed #1699 to investigate this

@michiyasunaga could you down priority to 3 and merge?

@michiyasunaga michiyasunaga merged commit 86ffdf6 into main Jun 28, 2023
@michiyasunaga michiyasunaga deleted the add_spider branch June 28, 2023 20:56
@michiyasunaga
Copy link
Member Author

Updated the priority and merged. Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants