add spider (from big-bench) #1385

michiyasunaga · 2023-02-23T07:17:37Z

Based on Percy's request, I added the Spider text-to-SQL task, available through BIG-bench. I checked that it runs.

percyliang · 2023-02-23T07:27:33Z

Nice! Could check that the accuracies we get are in line with what's reported in the literature for at least one model?

yifanmai · 2023-04-20T04:56:39Z

Michi, would you be able to do the result replication that Percy suggested, or would you like some help here?

michiyasunaga · 2023-04-26T21:54:00Z

Hi Percy and Yifan,

I did replication study for Spider using the GPT-3 model used in Big-bench. The results are below. Our numbers (HELM) are similar to those reported by Big-bench in Rouge 1, 2 and L (but are off in BLEU; perhaps there is some difference in the exact type/implementation of BLEU?).

Overall, this looks reasonable?

Model: openai/davinci

	Rouge 1	Rouge 2	Rouge L	BLEU
HELM (0-shot)	0.29	0.14	0.27	0.10
HELM (1-shot)	0.31	0.16	0.28	0.17
HELM (2-shot)	0.33	0.17	0.30	0.18
HELM (3-shot)	0.33	0.17	0.30	0.17
Big-bench reported (0-shot)	0.27	0.11	0.26	0.67
Big-bench reported (1-shot)	0.32	0.13	0.32	1.93
Big-bench reported (2-shot)	0.34	0.15	0.34	1.47
Big-bench reported (3-shot)	0.34	0.16	0.34	1.11

percyliang · 2023-04-27T05:19:39Z

Thanks! The differences in the BLEU scores is a bit concerning but it's big enough that it's probably a misconfiguration that should be relatively easy to track down?

michiyasunaga · 2023-04-27T07:18:53Z

I looked into the BLEU score mismatch in more detail.
Specifically, I took the predictions from our HELM Spider run, and tried evaluating the ROUGE and BLEU scores using both the HELM's metric calculation code and BIG-bench's metric calculation code (for details/reproduction, this notebook: https://colab.research.google.com/drive/1cOeMalGd9zbw51nT6HixtBPxaGqJY3i-).

The takeaways are

For both ROUGE and BLEU, the results roughly match between HELM's metric calculation code and BIG-bench's metric calculation code
Our ROUGE score matches the BIG-bench's reported ROUGE numbers (result table in the above comment)

From these observations, I think that our Spider run using HELM should be right and our metric calculations should be right too.
So perhaps there might be some mistake in the BIG-bench's reported BLEU numbers..?

For other BIG-bench scenarios than Spider, did the BLEU score in HELM match BIG-bench's reported BLEU score?

yifanmai · 2023-06-05T21:21:22Z

We don't have any other BIG-bench results in the official HELM releases, unfortunately.

My inclination is to merge this PR as is, but with priority lowered to 3. Then other folks could look into the replication issue later.

yifanmai · 2023-06-28T19:25:55Z

Filed #1699 to investigate this

@michiyasunaga could you down priority to 3 and merge?

michiyasunaga · 2023-06-28T20:57:16Z

Updated the priority and merged. Thanks for the review!

add spider (from big-bench)

75c068c

percyliang requested a review from yifanmai February 23, 2023 07:26

yifanmai approved these changes Apr 20, 2023

View reviewed changes

Update run_specs.conf

0f25e81

yifanmai mentioned this pull request Jun 28, 2023

Investigate BLEU score mismatch for BIG-Bench Spider task #1699

Open

update spider priority

b892591

michiyasunaga merged commit 86ffdf6 into main Jun 28, 2023

michiyasunaga deleted the add_spider branch June 28, 2023 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add spider (from big-bench) #1385

add spider (from big-bench) #1385

michiyasunaga commented Feb 23, 2023

percyliang commented Feb 23, 2023

yifanmai commented Apr 20, 2023

michiyasunaga commented Apr 26, 2023 •

edited

Loading

percyliang commented Apr 27, 2023

michiyasunaga commented Apr 27, 2023

yifanmai commented Jun 5, 2023

yifanmai commented Jun 28, 2023

michiyasunaga commented Jun 28, 2023

add spider (from big-bench) #1385

add spider (from big-bench) #1385

Conversation

michiyasunaga commented Feb 23, 2023

percyliang commented Feb 23, 2023

yifanmai commented Apr 20, 2023

michiyasunaga commented Apr 26, 2023 • edited Loading

percyliang commented Apr 27, 2023

michiyasunaga commented Apr 27, 2023

yifanmai commented Jun 5, 2023

yifanmai commented Jun 28, 2023

michiyasunaga commented Jun 28, 2023

michiyasunaga commented Apr 26, 2023 •

edited

Loading