-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CMU Hinglish DoG Dataset for MT #3149
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi ! I think the error happens because there are files that are opened in _split_generators
.
The dummy_data command should have printed something like
Your dataset seems to already open files in the method `_split_generators(...)`.
You might consider to instead only open files in the method `_generate_examples(...)` instead
To fix that I think you can move all the logic you added in _split_generators
(from line 93 to 119 into _generate_examples
instead. And in particular try not to create new .txt files, but rather yield examples directly from the json files.
You can pass all the variables you need from split_generators
to _generate_examples
. In particular if you need to pass the path to the directories that contain your data, you can do
hi_dirs = {
"train": os.path.join(data_dir_hi_en, "train"),
"valid": os.path.join(data_dir_hi_en, "valid"),
"test": os.path.join(data_dir_hi_en, "test"),
}
return [
datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"hi_dir": hi_dirs["train"], "en_dir": os.path.join(data_dir_en, "train")}),
datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs={"hi_dir": hi_dirs["valid"], "en_dir": os.path.join(data_dir_en, "train")}),
datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"hi_dir": hi_dirs["test"], "en_dir": os.path.join(data_dir_en, "train")}),
]
def _generate_examples(self, hi_dir, en_dir):
...
I hope that helps !
Let me know if you have other questions or if I can help
Hi @lhoestq, thanks a lot for the help. I have moved the part as suggested.
This KeyError is at times different from 'status' also.
I have tried removing unnecessary feature type definition, but that didn't help. Please let me know if I am missing something, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! it looks much cleaner now :)
I think you don't have to pass --json_field='history'
, otherwise it will ignore all the other fields like "status", can you try again without passing this parameter ?
Also some more minor comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes :)
I just fixed the feature type of the "translation" field to use the Translation
type, and regenerated the daatset_infos.json
I also added more information in the dataset card
I think it's all good now :) thanks for the contribution !
The CI fail is unrelated to this PR and fixed on master. Merging ! |
Address part of #2841
Added the CMU Hinglish DoG Dataset as in GLUECoS. Added it as a seperate dataset as unlike other tasks of GLUE CoS this can't be evaluated for a BERT like model.
Consists of parallel dataset between Hinglish (Hindi-English) and English, can be used for Machine Translation between the two.
The data processing part is inspired from the GLUECoS repo here
The dummy data part is not working properly, it shows
UnboundLocalError: local variable 'generator_splits' referenced before assignment
when I run without
--auto_generate
.Please let me know how I can fix that.
Thanks