Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pretrained GPT-2 checkpoint gets only 31% accuracy on Lambada #491

Closed
yaroslavvb opened this issue Apr 16, 2019 · 8 comments
Closed

pretrained GPT-2 checkpoint gets only 31% accuracy on Lambada #491

yaroslavvb opened this issue Apr 16, 2019 · 8 comments
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though)

Comments

@yaroslavvb
Copy link
Contributor

For some reason I only see 26% accuracy when evaluating on Lambada for GPT-2 checkpoint instead of expected 45.99%

Here's a file of predictions with sets of 3 lines of the form:

ground truth
predicted last_word
is_counted_as_error

Generated by this script

Could this be caused by the way GPT-2 checkpoint was imported into HuggingFace?

@yaroslavvb
Copy link
Contributor Author

Accuracy goes to 31% if I use stop-word filter, still seems lower than expected (predictions)

@yaroslavvb yaroslavvb changed the title pretrained GPT-2 checkpoint gets only 26% accuracy on Lambada pretrained GPT-2 checkpoint gets only 31% accuracy on Lambada Apr 16, 2019
@thomwolf
Copy link
Member

thomwolf commented Apr 16, 2019

Hi, I doubt it's a problem with the model. Usually the culprit is too find in the pre-processing logic.

Your dataset seems to be pre-processed but Radford, Wu et al. says they are using a version without preprocessing (end of section 3.3). GPT-2 is likely sensitive to tokenization issues and the like.

If you want to check the model it-self, you could try comparing with the predictions of the Tensorflow version on a few lambada completions?

@yaroslavvb
Copy link
Contributor Author

Applying detokenization raises accuracy to 33.11%

I spot checked a few errors against TF implementation and they give the same errors, so it seems likely the difference is due to eval protocol, rather than the checkpoint

@yaroslavvb
Copy link
Contributor Author

IMHO "without pre-processing" means taking the original dataset without modification, which is what I also did here.

However in the original dataset, everything is tokenized. IE "haven't" was turned into "have n't"
Either way, undoing this tokenization only has a improvement of 2%, so there must be some deeper underlying difference in the way OpenAI did their evaluation.

@thomwolf
Copy link
Member

thomwolf commented Apr 17, 2019

Indeed. It's not very clear to me what they mean exactly by "stop-word filter". It seems like the kind of heuristic that can have a very large impact on the performances.

Maybe a better filtering is key. I would probably go with a sort of beam-search to compute the probability of having a punctuation/end-of-sentence token after the predicted word and use that to filter the results.

@thomwolf thomwolf added Discussion Discussion on a topic (keep it focused or open a new issue though) GPT-2 labels Apr 17, 2019
@yaroslavvb yaroslavvb reopened this Apr 19, 2019
@yaroslavvb
Copy link
Contributor Author

yaroslavvb commented May 11, 2019

I spoke with Alec and turns out for evaluation they got used the "raw" lambada corpus which was obtained by finding original sentences in book corpus that matched the tokenized versions in the lambada release. So to to reproduce the numbers we need the "raw" corpus openai/gpt-2#131

@yaroslavvb
Copy link
Contributor Author

I'm now able to get within 1% of their reported accuracy on GPT2-small. The two missing modifications were:

  1. Evaluate on OpenAI's version of lambada which adds extra formatting
  2. Evaluate by counting number of times the last BPE token is predicted incorrectly instead of last word, details are in Release raw lambada dataset openai/gpt-2#131 (comment)

@KeyKy
Copy link

KeyKy commented Jul 31, 2024

I'm now able to get within 1% of their reported accuracy on GPT2-small. The two missing modifications were:

  1. Evaluate on OpenAI's version of lambada which adds extra formatting
  2. Evaluate by counting number of times the last BPE token is predicted incorrectly instead of last word, details are in Release raw lambada dataset openai/gpt-2#131 (comment)

"Evaluate the last BPE token" maybe wrong, because the prediction maybe use the information of ground turth when the last word is tokenized to two token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though)
Projects
None yet
Development

No branches or pull requests

3 participants