pretrained GPT-2 checkpoint gets only 31% accuracy on Lambada #491

yaroslavvb · 2019-04-16T02:04:24Z

For some reason I only see 26% accuracy when evaluating on Lambada for GPT-2 checkpoint instead of expected 45.99%

Here's a file of predictions with sets of 3 lines of the form:

ground truth
predicted last_word
is_counted_as_error

Generated by this script

Could this be caused by the way GPT-2 checkpoint was imported into HuggingFace?

yaroslavvb · 2019-04-16T02:46:51Z

Accuracy goes to 31% if I use stop-word filter, still seems lower than expected (predictions)

thomwolf · 2019-04-16T08:53:06Z

Hi, I doubt it's a problem with the model. Usually the culprit is too find in the pre-processing logic.

Your dataset seems to be pre-processed but Radford, Wu et al. says they are using a version without preprocessing (end of section 3.3). GPT-2 is likely sensitive to tokenization issues and the like.

If you want to check the model it-self, you could try comparing with the predictions of the Tensorflow version on a few lambada completions?

yaroslavvb · 2019-04-16T22:37:25Z

Applying detokenization raises accuracy to 33.11%

I spot checked a few errors against TF implementation and they give the same errors, so it seems likely the difference is due to eval protocol, rather than the checkpoint

yaroslavvb · 2019-04-16T22:45:31Z

IMHO "without pre-processing" means taking the original dataset without modification, which is what I also did here.

However in the original dataset, everything is tokenized. IE "haven't" was turned into "have n't"
Either way, undoing this tokenization only has a improvement of 2%, so there must be some deeper underlying difference in the way OpenAI did their evaluation.

thomwolf · 2019-04-17T09:02:17Z

Indeed. It's not very clear to me what they mean exactly by "stop-word filter". It seems like the kind of heuristic that can have a very large impact on the performances.

Maybe a better filtering is key. I would probably go with a sort of beam-search to compute the probability of having a punctuation/end-of-sentence token after the predicted word and use that to filter the results.

yaroslavvb · 2019-05-11T17:43:57Z

I spoke with Alec and turns out for evaluation they got used the "raw" lambada corpus which was obtained by finding original sentences in book corpus that matched the tokenized versions in the lambada release. So to to reproduce the numbers we need the "raw" corpus openai/gpt-2#131

yaroslavvb · 2019-05-29T23:39:37Z

I'm now able to get within 1% of their reported accuracy on GPT2-small. The two missing modifications were:

Evaluate on OpenAI's version of lambada which adds extra formatting
Evaluate by counting number of times the last BPE token is predicted incorrectly instead of last word, details are in Release raw lambada dataset openai/gpt-2#131 (comment)

KeyKy · 2024-07-31T03:04:38Z

I'm now able to get within 1% of their reported accuracy on GPT2-small. The two missing modifications were:

Evaluate on OpenAI's version of lambada which adds extra formatting

Evaluate by counting number of times the last BPE token is predicted incorrectly instead of last word, details are in Release raw lambada dataset openai/gpt-2#131 (comment)

"Evaluate the last BPE token" maybe wrong, because the prediction maybe use the information of ground turth when the last word is tokenized to two token.

yaroslavvb changed the title ~~pretrained GPT-2 checkpoint gets only 26% accuracy on Lambada~~ pretrained GPT-2 checkpoint gets only 31% accuracy on Lambada Apr 16, 2019

yaroslavvb closed this as completed Apr 16, 2019

thomwolf added Discussion Discussion on a topic (keep it focused or open a new issue though) GPT-2 labels Apr 17, 2019

yaroslavvb reopened this Apr 19, 2019

yaroslavvb mentioned this issue Apr 19, 2019

Perplexity number of wikitext-103 on gpt-2 don't match the paper #483

Closed

yaroslavvb closed this as completed May 29, 2019

StellaAthena mentioned this issue Nov 21, 2022

Clarify Lambada Task EleutherAI/lm-evaluation-harness#356

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pretrained GPT-2 checkpoint gets only 31% accuracy on Lambada #491

pretrained GPT-2 checkpoint gets only 31% accuracy on Lambada #491

yaroslavvb commented Apr 16, 2019

yaroslavvb commented Apr 16, 2019

thomwolf commented Apr 16, 2019 •

edited

Loading

yaroslavvb commented Apr 16, 2019

yaroslavvb commented Apr 16, 2019

thomwolf commented Apr 17, 2019 •

edited

Loading

yaroslavvb commented May 11, 2019 •

edited

Loading

yaroslavvb commented May 29, 2019

KeyKy commented Jul 31, 2024 •

edited

Loading

pretrained GPT-2 checkpoint gets only 31% accuracy on Lambada #491

pretrained GPT-2 checkpoint gets only 31% accuracy on Lambada #491

Comments

yaroslavvb commented Apr 16, 2019

yaroslavvb commented Apr 16, 2019

thomwolf commented Apr 16, 2019 • edited Loading

yaroslavvb commented Apr 16, 2019

yaroslavvb commented Apr 16, 2019

thomwolf commented Apr 17, 2019 • edited Loading

yaroslavvb commented May 11, 2019 • edited Loading

yaroslavvb commented May 29, 2019

KeyKy commented Jul 31, 2024 • edited Loading

thomwolf commented Apr 16, 2019 •

edited

Loading

thomwolf commented Apr 17, 2019 •

edited

Loading

yaroslavvb commented May 11, 2019 •

edited

Loading

KeyKy commented Jul 31, 2024 •

edited

Loading