Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tatoeba to v2021-07-22 #3225

Merged
merged 6 commits into from
Nov 12, 2021
Merged

Conversation

KoichiYasuoka
Copy link
Contributor

Tatoeba's latest version is v2021-07-22

@KoichiYasuoka
Copy link
Contributor Author

How about this? @lhoestq @abhishekkrthakur

@lhoestq
Copy link
Member

lhoestq commented Nov 8, 2021

Hi ! I think it would be nice if people could still be able to load the old version.
Maybe this can be a parameter ? For example to load the old version they could do

load_dataset("tatoeba", lang1="en", lang2="mr", date="v2020-11-09")

If it sounds good to you, we can add this parameter to the TatoebaConfig:

class TatoebaConfig(datasets.BuilderConfig):
    def __init__(self, *args, lang1=None, lang2=None, date="v2021-07-22", **kwargs):
    self.date = date

and then pass the date to the URL

_BASE_URL = "https://object.pouta.csc.fi/OPUS-Tatoeba/{}/moses/{}-{}.txt.zip"
        def _base_url(lang1, lang2, date):
            return _BASE_URL.format(date, lang1, lang2)

What do you think ?

@KoichiYasuoka
Copy link
Contributor Author

_DATE = "v" + "-".join(s.zfill(2) for s in _VERSION.split(".")) seems rather tricky but works well. How about this? @lhoestq

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good thanks !

I just renamed the parent directory of the dummy data to match the new version name, and mentioned how to change the date in the dataset card.

@lhoestq
Copy link
Member

lhoestq commented Nov 12, 2021

The CI is only failing because of the missing sections in the dataset card, and because of an issue with the CER metric that is unrelated to this PR

@lhoestq lhoestq merged commit ab20ef7 into huggingface:master Nov 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants