-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update tatoeba to v2021-07-22 #3225
Conversation
How about this? @lhoestq @abhishekkrthakur |
Hi ! I think it would be nice if people could still be able to load the old version. load_dataset("tatoeba", lang1="en", lang2="mr", date="v2020-11-09") If it sounds good to you, we can add this parameter to the TatoebaConfig: class TatoebaConfig(datasets.BuilderConfig):
def __init__(self, *args, lang1=None, lang2=None, date="v2021-07-22", **kwargs):
self.date = date and then pass the date to the URL _BASE_URL = "https://object.pouta.csc.fi/OPUS-Tatoeba/{}/moses/{}-{}.txt.zip" def _base_url(lang1, lang2, date):
return _BASE_URL.format(date, lang1, lang2) What do you think ? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good thanks !
I just renamed the parent directory of the dummy data to match the new version name, and mentioned how to change the date in the dataset card.
The CI is only failing because of the missing sections in the dataset card, and because of an issue with the CER metric that is unrelated to this PR |
Tatoeba's latest version is v2021-07-22