Clarify Lambada Task #356
Labels
bug
Something isn't working.
documentation
Improvements or additions to documentation.
help wanted
Contributors and extra help welcome.
When OpenAI created GPT-2, they also created a custom, non-standard lambada evaluation dataset. OpenAI also changed the metric for evaluation by counting number of times the last BPE token is predicted incorrectly instead of the last word. This produces a huge difference in performance score, totally over 10%. They used this easier version of Lambada for evaluating GPT-2 as well. For more details, see here and here.
According to @jon-tow, we implement the openai version, not the standard version. We should implement both, and call them
lambada_standard
andlambada_openai
respectively. In particular, we should not implement a task called lambada because years after the fact this is still causing widespread confusion and we want to force the user to pay attention to it.The text was updated successfully, but these errors were encountered: