Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CMU Hinglish DoG Dataset for MT #3149

Merged
merged 9 commits into from
Nov 15, 2021

Conversation

Ishan-Kumar2
Copy link
Contributor

Address part of #2841

Added the CMU Hinglish DoG Dataset as in GLUECoS. Added it as a seperate dataset as unlike other tasks of GLUE CoS this can't be evaluated for a BERT like model.
Consists of parallel dataset between Hinglish (Hindi-English) and English, can be used for Machine Translation between the two.

The data processing part is inspired from the GLUECoS repo here
The dummy data part is not working properly, it shows
UnboundLocalError: local variable 'generator_splits' referenced before assignment
when I run without --auto_generate.

Please let me know how I can fix that.
Thanks

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi ! I think the error happens because there are files that are opened in _split_generators.

The dummy_data command should have printed something like

Your dataset seems to already open files in the method `_split_generators(...)`.
You might consider to instead only open files in the method `_generate_examples(...)` instead

To fix that I think you can move all the logic you added in _split_generators (from line 93 to 119 into _generate_examples instead. And in particular try not to create new .txt files, but rather yield examples directly from the json files.

You can pass all the variables you need from split_generators to _generate_examples. In particular if you need to pass the path to the directories that contain your data, you can do

        hi_dirs = {
            "train": os.path.join(data_dir_hi_en, "train"),
            "valid": os.path.join(data_dir_hi_en, "valid"),
            "test": os.path.join(data_dir_hi_en, "test"),
        }

        return [
            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"hi_dir": hi_dirs["train"], "en_dir": os.path.join(data_dir_en, "train")}),
            datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs={"hi_dir": hi_dirs["valid"], "en_dir": os.path.join(data_dir_en, "train")}),
            datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"hi_dir": hi_dirs["test"], "en_dir": os.path.join(data_dir_en, "train")}),
        ]

    def _generate_examples(self, hi_dir, en_dir):
        ...

I hope that helps !
Let me know if you have other questions or if I can help

@Ishan-Kumar2
Copy link
Contributor Author

Hi @lhoestq, thanks a lot for the help. I have moved the part as suggested.
Although still while running the dummy data script, I face this issue

Traceback (most recent call last):
  File "/home/ishan/anaconda3/bin/datasets-cli", line 8, in <module>
    sys.exit(main())
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/commands/datasets_cli.py", line 33, in main
    service.run()
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/commands/dummy_data.py", line 318, in run
    self._autogenerate_dummy_data(
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/commands/dummy_data.py", line 363, in _autogenerate_dummy_data
    dataset_builder._prepare_split(split_generator)
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/builder.py", line 1103, in _prepare_split
    example = self.info.features.encode_example(record)
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/features/features.py", line 981, in encode_example
    return encode_nested_example(self, example)
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/features/features.py", line 775, in encode_nested_example
    return {
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/features/features.py", line 775, in <dictcomp>
    return {
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 99, in zip_dict
    yield key, tuple(d[key] for d in dicts)
  File "/home/ishan/anaconda3/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 99, in <genexpr>
    yield key, tuple(d[key] for d in dicts)
KeyError: 'status'

This KeyError is at times different from 'status' also.
when I run

datasets-cli dummy_data datasets/cmu_hinglish_dog --auto_generate --json_field='history'

I have tried removing unnecessary feature type definition, but that didn't help. Please let me know if I am missing something, thanks!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! it looks much cleaner now :)

I think you don't have to pass --json_field='history', otherwise it will ignore all the other fields like "status", can you try again without passing this parameter ?

Also some more minor comments:

datasets/cmu_hinglish_dog/cmu_hinglish_dog.py Outdated Show resolved Hide resolved
datasets/cmu_hinglish_dog/cmu_hinglish_dog.py Outdated Show resolved Hide resolved
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes :)

I just fixed the feature type of the "translation" field to use the Translation type, and regenerated the daatset_infos.json

I also added more information in the dataset card

I think it's all good now :) thanks for the contribution !

@lhoestq
Copy link
Member

lhoestq commented Nov 15, 2021

The CI fail is unrelated to this PR and fixed on master. Merging !

@lhoestq lhoestq merged commit 07abca2 into huggingface:master Nov 15, 2021
@Ishan-Kumar2 Ishan-Kumar2 deleted the CMUHinglishDoG branch November 15, 2021 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants