-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release raw lambada dataset #131
Comments
we just use the plain text files which can be downloaded here https://zenodo.org/record/2630551#.XNxg89NKjUI |
That's a post-processed version, ie, "don't" is split into "do n't" etc. GPT-2-small gets around 31% on that set. My understanding from @Newmu was that 45.99 figure from Table 3. in the paper was on raw/non-processed version |
we apply "de-tokenizers" to remove some of the artifacts. Alec can verify but I think in this case it's simply
in fact the detokenizer should be invertible, although i don't think that's important for the accuracy numbers |
As recommended in openai/gpt-2#131 Original suggestion makes no difference because official release doesn’t have smart quotes. Adding ``->” and ‘’->” rules improves result 0.3%
This detokenizer doesn't do anything on the official Lambada dataset since there are no smart quotes in it. My understanding is that OpenAI used its own version of Lambada dataset generated from book corpus/lambada. This dataset is interesting because of the accuracy gap in GPT2-small numbers -- 34% on official Lambada vs 46% on OpenAI's version. |
my bad, you're right, whoops! try this: gs://gpt-2/data/lambada_test.jsonl |
Following advice in openai/gpt-2#131
Thanks, that dataset makes a difference. I'm now getting 41.98 using GPT2-small, this version of dataset + with length-5 beam search decoding of last word for stop-word removal. Simplifying the procedure to test accuracy by comparing for equality of last BPE token instead of last word the accuracy is up to 46.89 I'm wondering if this should be called "lambada-openai" or something in tables to avoid confusion. I looked at the errors between the two datasets, and it seems easier because formatting provides extra information. Official Lambada
This version
|
yeah i agree keeping the extra information is potentially useful (even for non-zero-shot) and it's probably good to distinguish it from the original dataset |
Hi I'm also looking to run the same test. Can you fix the
|
should now be at |
Is it possible to release the Lambada dataset used to generate accuracy numbers in Table 3 of the paper? This would make it easier to do comparisons with other models :)
@Newmu
The text was updated successfully, but these errors were encountered: