Updated OPUS Open Subtitles Dataset with metadata information #1865

Valahaar · 2021-02-11T13:26:26Z

Close #1844

Problems:

I ran python datasets-cli test datasets/open_subtitles --save_infos --all_configs, hence the change in dataset_infos.json, but it appears that the metadata features have not been added for all pairs. Any idea why that might be?
Possibly related to the above, I tried doing pip uninstall datasets && pip install -e ".[dev]" after the changes, and loading the dataset via load_dataset("open_subtitles", lang1='hi', lang2='it') to check if the update worked, but the loaded dataset did not contain the metadata fields (neither in the features nor doing next(iter(dataset['train']))). What step(s) did I miss?

Questions:

Is it ok to have a classmethod in there? I have not seen any in the few other datasets I have checked. I could make it a local method of the _generate_examples method, but I'd rather not duplicate the logic...

…imdb id, open subtitles id, sentence ids)

lhoestq · 2021-02-12T16:42:17Z

Hi !
About the problems you mentioned:

Saving the infos is only done for the configurations inside the BUILDER_CONFIGS. Otherwise you would need to run the scripts on ALL language pairs, which is not what we want.
Moreover when you're on your branch, please specify the path to your local version of the dataset script, like "./datasets/open_subtitles". Otherwise the dataset is loaded from the master branch on github.
Hope that clarifies things a bit

And of course feel free to add methods or classmethods to your builder.

lhoestq

Thanks for the addition of the metadata ! This is really cool

Valahaar · 2021-02-12T17:38:24Z

Great! Thank you :)
I'll close the issue as well.

Updated OPUS Open Subtitles Dataset with metadata information (year, …

0102656

…imdb id, open subtitles id, sentence ids)

remove old configs

bfbe35e

lhoestq approved these changes Feb 12, 2021

View reviewed changes

lhoestq merged commit 3e6afdd into huggingface:master Feb 12, 2021

Valahaar deleted the opus-ost-metadata branch February 12, 2021 17:37

Valahaar mentioned this pull request Feb 12, 2021

Update Open Subtitles corpus with original sentence IDs #1844

Closed

Provide feedback