Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated OPUS Open Subtitles Dataset with metadata information #1865

Merged
merged 2 commits into from
Feb 12, 2021

Conversation

Valahaar
Copy link
Contributor

@Valahaar Valahaar commented Feb 11, 2021

Close #1844

Problems:

  • I ran python datasets-cli test datasets/open_subtitles --save_infos --all_configs, hence the change in dataset_infos.json, but it appears that the metadata features have not been added for all pairs. Any idea why that might be?
  • Possibly related to the above, I tried doing pip uninstall datasets && pip install -e ".[dev]" after the changes, and loading the dataset via load_dataset("open_subtitles", lang1='hi', lang2='it') to check if the update worked, but the loaded dataset did not contain the metadata fields (neither in the features nor doing next(iter(dataset['train']))). What step(s) did I miss?

Questions:

  • Is it ok to have a classmethod in there? I have not seen any in the few other datasets I have checked. I could make it a local method of the _generate_examples method, but I'd rather not duplicate the logic...

@lhoestq
Copy link
Member

lhoestq commented Feb 12, 2021

Hi !
About the problems you mentioned:

  • Saving the infos is only done for the configurations inside the BUILDER_CONFIGS. Otherwise you would need to run the scripts on ALL language pairs, which is not what we want.
  • Moreover when you're on your branch, please specify the path to your local version of the dataset script, like "./datasets/open_subtitles". Otherwise the dataset is loaded from the master branch on github.
    Hope that clarifies things a bit

And of course feel free to add methods or classmethods to your builder.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition of the metadata ! This is really cool

@lhoestq lhoestq merged commit 3e6afdd into huggingface:master Feb 12, 2021
@Valahaar Valahaar deleted the opus-ost-metadata branch February 12, 2021 17:37
@Valahaar
Copy link
Contributor Author

Great! Thank you :)
I'll close the issue as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Open Subtitles corpus with original sentence IDs
2 participants