Clarify Lambada Task #356

StellaAthena · 2022-11-21T19:56:09Z

When OpenAI created GPT-2, they also created a custom, non-standard lambada evaluation dataset. OpenAI also changed the metric for evaluation by counting number of times the last BPE token is predicted incorrectly instead of the last word. This produces a huge difference in performance score, totally over 10%. They used this easier version of Lambada for evaluating GPT-2 as well. For more details, see here and here.

According to @jon-tow, we implement the openai version, not the standard version. We should implement both, and call them lambada_standard and lambada_openai respectively. In particular, we should not implement a task called lambada because years after the fact this is still causing widespread confusion and we want to force the user to pay attention to it.

The text was updated successfully, but these errors were encountered:

StellaAthena added bug Something isn't working. documentation Improvements or additions to documentation. help wanted Contributors and extra help welcome. labels Nov 21, 2022

jon-tow mentioned this issue Nov 25, 2022

Add the original LAMBADA dataset #357

Merged

StellaAthena closed this as completed Nov 26, 2022

jayelm mentioned this issue Dec 23, 2022

Dummy perplexity on LAMBADA #350

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify Lambada Task #356

Clarify Lambada Task #356

StellaAthena commented Nov 21, 2022

Clarify Lambada Task #356

Clarify Lambada Task #356

Comments

StellaAthena commented Nov 21, 2022