Add CMU Hinglish DoG Dataset for MT #3149

Ishan-Kumar2 · 2021-10-22T16:17:25Z

Address part of #2841

Added the CMU Hinglish DoG Dataset as in GLUECoS. Added it as a seperate dataset as unlike other tasks of GLUE CoS this can't be evaluated for a BERT like model.
Consists of parallel dataset between Hinglish (Hindi-English) and English, can be used for Machine Translation between the two.

The data processing part is inspired from the GLUECoS repo here
The dummy data part is not working properly, it shows
UnboundLocalError: local variable 'generator_splits' referenced before assignment
when I run without --auto_generate.

Please let me know how I can fix that.
Thanks

lhoestq

Hi ! I think the error happens because there are files that are opened in _split_generators.

The dummy_data command should have printed something like

Your dataset seems to already open files in the method `_split_generators(...)`.
You might consider to instead only open files in the method `_generate_examples(...)` instead

To fix that I think you can move all the logic you added in _split_generators (from line 93 to 119 into _generate_examples instead. And in particular try not to create new .txt files, but rather yield examples directly from the json files.

You can pass all the variables you need from split_generators to _generate_examples. In particular if you need to pass the path to the directories that contain your data, you can do

        hi_dirs = {
            "train": os.path.join(data_dir_hi_en, "train"),
            "valid": os.path.join(data_dir_hi_en, "valid"),
            "test": os.path.join(data_dir_hi_en, "test"),
        }

        return [
            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"hi_dir": hi_dirs["train"], "en_dir": os.path.join(data_dir_en, "train")}),
            datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs={"hi_dir": hi_dirs["valid"], "en_dir": os.path.join(data_dir_en, "train")}),
            datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"hi_dir": hi_dirs["test"], "en_dir": os.path.join(data_dir_en, "train")}),
        ]

    def _generate_examples(self, hi_dir, en_dir):
        ...

I hope that helps !
Let me know if you have other questions or if I can help

Ishan-Kumar2 · 2021-11-05T13:40:16Z

Hi @lhoestq, thanks a lot for the help. I have moved the part as suggested.
Although still while running the dummy data script, I face this issue

Traceback (most recent call last):
  File "/home/ishan/anaconda3/bin/datasets-cli", line 8, in <module>
    sys.exit(main())
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/commands/datasets_cli.py", line 33, in main
    service.run()
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/commands/dummy_data.py", line 318, in run
    self._autogenerate_dummy_data(
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/commands/dummy_data.py", line 363, in _autogenerate_dummy_data
    dataset_builder._prepare_split(split_generator)
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/builder.py", line 1103, in _prepare_split
    example = self.info.features.encode_example(record)
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/features/features.py", line 981, in encode_example
    return encode_nested_example(self, example)
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/features/features.py", line 775, in encode_nested_example
    return {
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/features/features.py", line 775, in <dictcomp>
    return {
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 99, in zip_dict
    yield key, tuple(d[key] for d in dicts)
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 99, in <genexpr>
    yield key, tuple(d[key] for d in dicts)
KeyError: 'status'

This KeyError is at times different from 'status' also.
when I run

datasets-cli dummy_data datasets/cmu_hinglish_dog --auto_generate --json_field='history'

I have tried removing unnecessary feature type definition, but that didn't help. Please let me know if I am missing something, thanks!

lhoestq

Nice ! it looks much cleaner now :)

I think you don't have to pass --json_field='history', otherwise it will ignore all the other fields like "status", can you try again without passing this parameter ?

Also some more minor comments:

datasets/cmu_hinglish_dog/cmu_hinglish_dog.py

lhoestq

Thanks for the changes :)

I just fixed the feature type of the "translation" field to use the Translation type, and regenerated the daatset_infos.json

I also added more information in the dataset card

I think it's all good now :) thanks for the contribution !

lhoestq · 2021-11-15T10:27:39Z

The CI fail is unrelated to this PR and fixed on master. Merging !

Ishan-Kumar2 added 2 commits October 22, 2021 20:24

added CMU Hinglish DoG Dataset for MT

2dbcfe8

Merge remote-tracking branch 'upstream/master' into CMUHinglishDoG

70b9d2e

lhoestq reviewed Oct 29, 2021

View reviewed changes

moved code from split_generators to generate_examples

a5b0ae1

lhoestq reviewed Nov 5, 2021

View reviewed changes

datasets/cmu_hinglish_dog/cmu_hinglish_dog.py Outdated Show resolved Hide resolved

datasets/cmu_hinglish_dog/cmu_hinglish_dog.py Outdated Show resolved Hide resolved

added dummy data and modified search for english file

0c87a20

Ishan-Kumar2 commented Nov 11, 2021

View reviewed changes

datasets/cmu_hinglish_dog/cmu_hinglish_dog.py Outdated Show resolved Hide resolved

Ishan-Kumar2 and others added 5 commits November 12, 2021 01:01

fix black issue

eec377e

modified return type

62837af

fix dataset script

295ff68

update dataset card

31b22bd

minor

beee160

lhoestq approved these changes Nov 15, 2021

View reviewed changes

lhoestq merged commit 07abca2 into huggingface:master Nov 15, 2021

Ishan-Kumar2 deleted the CMUHinglishDoG branch November 15, 2021 11:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CMU Hinglish DoG Dataset for MT #3149

Add CMU Hinglish DoG Dataset for MT #3149

Ishan-Kumar2 commented Oct 22, 2021

lhoestq left a comment

Ishan-Kumar2 commented Nov 5, 2021

lhoestq left a comment

lhoestq left a comment

lhoestq commented Nov 15, 2021

Add CMU Hinglish DoG Dataset for MT #3149

Add CMU Hinglish DoG Dataset for MT #3149

Conversation

Ishan-Kumar2 commented Oct 22, 2021

lhoestq left a comment

Choose a reason for hiding this comment

Ishan-Kumar2 commented Nov 5, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq commented Nov 15, 2021