-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pretrained GPT-2 checkpoint gets only 31% accuracy on Lambada #491
Comments
Accuracy goes to 31% if I use stop-word filter, still seems lower than expected (predictions) |
Hi, I doubt it's a problem with the model. Usually the culprit is too find in the pre-processing logic. Your dataset seems to be pre-processed but Radford, Wu et al. says they are using a version without preprocessing (end of section 3.3). GPT-2 is likely sensitive to tokenization issues and the like. If you want to check the model it-self, you could try comparing with the predictions of the Tensorflow version on a few lambada completions? |
Applying detokenization raises accuracy to 33.11% I spot checked a few errors against TF implementation and they give the same errors, so it seems likely the difference is due to eval protocol, rather than the checkpoint |
IMHO "without pre-processing" means taking the original dataset without modification, which is what I also did here. However in the original dataset, everything is tokenized. IE "haven't" was turned into "have n't" |
Indeed. It's not very clear to me what they mean exactly by "stop-word filter". It seems like the kind of heuristic that can have a very large impact on the performances. Maybe a better filtering is key. I would probably go with a sort of beam-search to compute the probability of having a punctuation/end-of-sentence token after the predicted word and use that to filter the results. |
I spoke with Alec and turns out for evaluation they got used the "raw" lambada corpus which was obtained by finding original sentences in book corpus that matched the tokenized versions in the lambada release. So to to reproduce the numbers we need the "raw" corpus openai/gpt-2#131 |
I'm now able to get within 1% of their reported accuracy on GPT2-small. The two missing modifications were:
|
"Evaluate the last BPE token" maybe wrong, because the prediction maybe use the information of ground turth when the last word is tokenized to two token. |
For some reason I only see 26% accuracy when evaluating on Lambada for GPT-2 checkpoint instead of expected 45.99%
Here's a file of predictions with sets of 3 lines of the form:
ground truth
predicted last_word
is_counted_as_error
Generated by this script
Could this be caused by the way GPT-2 checkpoint was imported into HuggingFace?
The text was updated successfully, but these errors were encountered: