-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CCAligned Multilingual Dataset #1815
Add CCAligned Multilingual Dataset #1815
Conversation
Hi ! We already have some datasets that can have many many configurations possible. For testing, only the configurations defined in the So what I would do in your case is have something like class CCAlignedConfig(datasets.BuilderConfig):
def __init__(self, *args, documents_or_sentences=None, language_code=None, **kwargs):
super().__init__(
*args,
name=f"{documents_or_sentences}-{language_code}",
**kwargs,
)
self.documents_or_sentences = documents_or_sentences
self.language_code = language_code And of course, feel free to change/rename things if you want to. In particular I think we can improve the name of the parameter |
Hi @lhoestq, Thanks a lot! I don't know why I didn't think about that. :P |
Hi @lhoestq, I have tested and added dummy files. Request you to review. Also, does this mean BUILDER_CONFIGS is only needed while testing? |
Hi @lhoestq, Any changes required on this one? Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, Sorry for the delay ^^'
That's awesome thanks ! Good job with the dataset config
I left a few comments.
Also could you try to reduce the size of the dummy data for documents-ak_GH
please ? It's currently 3.8MB and it would be awesome to have something less than 20KB
Hi @lhoestq, Sorry for the delay, I have added the changes from the review. For the ISO format language codes, I just selected the first two characters from the names, hoping those are correct. Let me know if you want me to verify :P Thanks for taking the time to add such a detailed review. I'll keep all these changes in mind the next time I'm adding a dataset. Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes :)
I added just a few comments in the dataset card to take into account the change to the Translation feature type.
Also it looks like the dummy_data.zip file for the ak_GH configuration is quite big (3MB), can you try to reduce its size ? Ideally it should be just a few KB like for the other configs.
An instance of `documents` type: | ||
|
||
``` | ||
{'Domain': 'cjtaisangha.com', 'Source_Content': 'Activities|Search|Search|Home|News|News for education|Our Club|B.A. Students|PGD Students|M.A. Students|M.Phil Studensts|Ph.D. Students|List of CJ Students|Dhamma|MP3|MP-4|Activities|CJTS.Members|E-Books|Ceylon Journey News|SGS Libaray|book list|English|Thai|Shan|Myanmar|Download|Gallery|Contact us|Donate|Links|Skip to content|Home|News|News for education|Our Club|B.A. Students|PGD Students|M.A. Students|M.Phil Studensts|Ph.D. Students|List of CJ Students|Dhamma|MP3|MP-4|Activities|CJTS.Members|E-Books|Ceylon Journey News|SGS Libaray|book list|English|Thai|Shan|Myanmar|Download|Gallery|Contact us|Donate|Links|Activities|Sorry, this entry is only available in Shan.|(Shan) သွၵ်ႈႁႃ|Search for:|Search|English:|English|Shan|Buddha|Calender|December 2019|M|T|W|T|F|S|S|« Sep|1|2 3 4 5 6 7 8|9 10 11 12 13 14 15|16 17 18 19 20 21 22|23 24 25 26 27 28 29|30 31|Cjtaisangha’s facebook|(Shan) ပပ်ႉၸဝ်ႈၶူး Dr. မႁေႃသထႃလင်ၵႃရႃၽိဝမ်သ|(Shan)|Recent Posts|(Shan) ႁၢင်ႈၽၢင်မၢႆတွင်း ႁပ်ႉၸုမ်ႈၶူး M.A & P.G.D တီႈၸၼ်ႉၸွမ် ၵေႇလၼိယ လႄႈ ပၢင်ႁူပ်ႉထူပ်းလုၵ်ႈႁဵၼ်းသင်ႇၶတႆး မိူင်းသီႇႁူဝ်ႇ|(Shan) ပွင်ႇလႅင်းထိုင် ၽူႈႁၵ်ႉလိၵ်ႈလၢႆးပၢႆပႄႇႁဝ်းၶဝ်တင်းသဵင်ႈတီႈၶႃႈ ဢမ်ႇပေႃးႁိုင်ပေႃးၼၢၼ်း ဢၼ်တေပဵၼ်ပပ်ႉမၢႆတွင်း ပီၵွၼ်းၶမ်း (50) ပီတဵမ်|(Shan) ယွၼ်းမူႇလိၵ်ႈၶၢဝ်းတၢင်းသီႇႁူဝ်ႇ မၢႆ-21 ၊ ပီ – 2019 ထိုင် ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉႁဝ်းၶဝ်တင်းသဵင်ႈ|(Shan) လုၵ်ႈႁဵၼ်းသင်ႇၶတႆး လႆႈၸုမ်ႈ M.A တီႈၸၼ်ႉၸွမ် Buddhist & Pāli Unversity မိူင်းသီႇရိလင်းၵႃ၊ သီႇၸဝ်ႈ|Congraduration to you all|Recent Commens|Resent Ports|September 2019|August 2019|June 2019|February 2019|December 2018|November 2018|October 2018|September 2018|August 2018|July 2018|May 2018|April 2018|March 2018|December 2017|November 2017|October 2017|September 2017|August 2017|July 2017|June 2017|May 2017|(Shan) မူႇၵေႃ|Activities|Local News|News|Uncategorized|World News|Meta|Log in|Entries RSS|Comments RSS|WordPress.org|Visiter|Copyright 2015 © 2017 Cjtaisangha.com |Makutarama Temple 42/15 Reservoir Road, Dematagoda, Colombo 09, Sri Lanka Tel: 009 411 2662488 Email: Lankajourney@yahoo.co.uk |Ribosome by GalussoThemes.com|Powered by WordPress|', 'Source_URL': 'http://www.cjtaisangha.com/en/activities/', 'Target_Content': 'ၵၢၼ်တူင်ႉၼိုင်|Search|Search|ၼႃႈႁူဝ်ႁႅၵ်ႈ|ၶၢဝ်ႇၵူႈလွင်ႈ|ၶၢဝ်ႇလွင်ႈၵၢၼ်ႁဵၼ်း|ၽႂ်ပဵၼ်ၽႂ်ၼႂ်းႁဝ်းႁႃး|လုၵ်ႈႁဵၼ်းၸၼ်ႉ BA|လုၵ်ႈႁဵၼ်းၸၼ်ႉ PGD|လုၵ်ႈႁဵၼ်းၸၼ်ႉ MA|လုၵ်ႈႁဵၼ်းၸၼ်ႉ M.Phil|လုၵ်ႈႁဵၼ်းၸၼ်ႉ PhD|သဵၼ်ႈမၢႆလုၵ်ႈႁဵၼ်းၵူႈပီပီ|ထမ်ႇမ|MP3|MP-4|ၵၢၼ်တူင်ႉၼိုင်|ၸုမ်းၽူႈပွင်ၵၢၼ်|ႁွင်ႈပပ်ႉလိၵ်ႈ|ၶၢဝ်းတၢင်းသီႇႁူဝ်ႇ|ႁူင်းတူၺ်းလိၵ်ႈ SGS|သဵၼ်ႈမၢႆပပ်ႉ|ဢင်းၵိတ်ႉ|ၽၢႆႇထႆး|ၽၢႆႇတႆး|ၽၢႆႇမၢၼ်ႈ|တႃႇလုတ်ႇလူင်း|ၶႅပ်းႁၢင်ႈ|တီႈၵပ်းသိုပ်ႇ|ၵပ်းသိုပ်ႇလူႇတၢၼ်း|လိင်ႉ|Skip to content|ၼႃႈႁူဝ်ႁႅၵ်ႈ|ၶၢဝ်ႇၵူႈလွင်ႈ|ၶၢဝ်ႇလွင်ႈၵၢၼ်ႁဵၼ်း|ၽႂ်ပဵၼ်ၽႂ်ၼႂ်းႁဝ်းႁႃး|လုၵ်ႈႁဵၼ်းၸၼ်ႉ BA|လုၵ်ႈႁဵၼ်းၸၼ်ႉ PGD|လုၵ်ႈႁဵၼ်းၸၼ်ႉ MA|လုၵ်ႈႁဵၼ်းၸၼ်ႉ M.Phil|လုၵ်ႈႁဵၼ်းၸၼ်ႉ PhD|သဵၼ်ႈမၢႆလုၵ်ႈႁဵၼ်းၵူႈပီပီ|ထမ်ႇမ|MP3|MP-4|ၵၢၼ်တူင်ႉၼိုင်|ၸုမ်းၽူႈပွင်ၵၢၼ်|ႁွင်ႈပပ်ႉလိၵ်ႈ|ၶၢဝ်းတၢင်းသီႇႁူဝ်ႇ|ႁူင်းတူၺ်းလိၵ်ႈ SGS|သဵၼ်ႈမၢႆပပ်ႉ|ဢင်းၵိတ်ႉ|ၽၢႆႇထႆး|ၽၢႆႇတႆး|ၽၢႆႇမၢၼ်ႈ|တႃႇလုတ်ႇလူင်း|ၶႅပ်းႁၢင်ႈ|တီႈၵပ်းသိုပ်ႇ|ၵပ်းသိုပ်ႇလူႇတၢၼ်း|လိင်ႉ|ၵၢၼ်တူင်ႉၼိုင်|မိူဝ်ႈဝၼ်းထိ 19-20 May 2018 ၼၼ်ႉ ၸဝ်ႈသြႃႇဝိၸယႃၽိပႃလ ပူင်သွၼ်ပၼ် ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉ ဢၼ်မႃးႁဵၼ်းလိၵ်ႈ တီႈမိူင်းသီႇႁူဝ်ႇၼႆႉ၊ ၼင်ႇႁိုဝ် မိူဝ်းၼႃႈထႃႈပၢႆမႃး ပေႃႈတေမီးလွင်ႈ ယုမ်ႇတူဝ်ယုမ်ႇၸႂ်သေဢမ်ႇၵႃး ၼင်ႇႁိုဝ်ပေႃးတေ မီးလွင်ႈတူဝ်ႈတၼ်းလီ တွၼ်ႈတႃႇႁဵၼ်းလႆႈထိုင်ၸၼ်ႉသုင်သုင်ၼႆၼၼ်ႉလႄႈ ၸဝ်ႈသြႃႇဝိၸယႃၽိပႃလ ၸင်ႇလႆႈမီးၸႂ် မဵတ်ႉတႃႇယႂ်ႇၼိူဝ် ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉပေႃးတေမီးလွင်ႈတၢင်းႁူႉတၢင်းႁၼ် ၼိူဝ်လိၵ်ႈလၢႆးပၢႆပႄႇၼႆသေ ပူင်သွၼ်ပၼ် လွၵ်းလၢႆးတႅမ်ႈလိၵ်ႈ M.Phil တီႈဝတ်ႉမၵုတႃႇရႃႇမ၊ ႁွင်ႈတူၺ်းလိၵ်ႈ ၸဝ်ႈၵၢင်းသိူဝ် ၵႂႃႇၼႆယူႇယဝ်ႉ။|တေလႆႈႁဵတ်းႁိုဝ် ဝႆႉဝၢင်းတူၼ်ႈထႅဝ်၊ တေလႆႈႁဵတ်းႁိုဝ်ၶပ်ႉလိၵ်ႈမႅၼ်ႈမႅၼ်ႈၸွမ်းပိူင်၊ တေလႆႈႁဵတ်းႁိုဝ်သႂ်ႇ Footnote , တေႁဵတ်းႁိုဝ် သႂ်ႇၽိုၼ်ဢိင် တႄႇၵႂႃႇၸိူဝ်းၼႆႉ တေလႆႈဝႃႈ ပၼ်တၢင်းႁူႉႁၼ် တၢင်းၼမ်တၢင်းလၢႆယူႇ။ ပဵၼ်ဢၼ်လီမၢႆ လီတွင်းဝႆႉတႄႉတႄႉယူႇယဝ်ႉ။|ၶၢဝ်းတၢင်းသီႇႁူဝ်ႇ|မိူဝ်ႈဝၼ်းထိ 12.04.2018 ၸဝ်ႈသြႃႇဝိၸယႃၽိပႃလ ဢွၼ်ႁူဝ် ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉ လုၵ်ႈႁဵၼ်းသင်ႇၶတႆး ၊ မိူင်းသီႇႁူဝ်ႇ ၶပ်ႉပပ်ႉလိၵ်ႈ၊ ႁူင်းတူၺ်းလိၵ်ႈ ၸဝ်ႈၵၢင်းသိူဝ် ၶၢဝ်းတၢင်းတႄႇၶပ်ႉပပ်ႉလိၵ်ႈ တႄႇဝၼ်းထိ 10 – 20. 04 . 2018 ၼႆယူႇယဝ်ႉ။|ၸိူဝ်းပဵၼ်ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉ ၸိူဝ်းမီးၶၢဝ်းယၢမ်းတူဝ်ႈတၼ်းၸိူဝ်းၼၼ်ႉ လႄႈ မၢင်ၸိူဝ်းၵေႃႈ တတ်းဢဝ်ၶၢဝ်းယၢမ်း ၼႂ်းၵႄႈဢမ်ႇတၼ်းၼၼ်ႉသေ မႃးၸွႆႈမႃးထႅမ် မိူၼ်ၼင်ႇဝႃႈ ႁူမ်ႈၵၼ်ၵိၼ်ၸင်ႇဝၢၼ် ႁူမ်ႈၵၼ်ႁၢမ်ၸင်ႇမဝ် ၼႆၼၼ်ႉ ယဝ်ႉ။ တွၼ်ႈတႃႇၼိုင်ႈပီၼိုင်ႈပီၼႆႉ ဢၼ်ပဵၼ်ပပ်ႉလိၵ်ႈ ဢင်းၵိတ်ႉ၊ တႆး ၊ ထႆး၊ မၢၼ်ႈ တႄႇၵႂႃႇၸိူဝ်းၼႆႉၵေႃႈ တိူဝ်းၼမ် မႃးတိၵ်းꧦ ၵူႈပီပီလႄႈ လႆႈ Update ပၼ်သဵၼ်ႈမၢႆမၼ်ႈယူႇ ၵူႈပီပီၼႆယဝ်ႉ။ ဢၼ်ၼမ်လိူဝ်ၼႆႉတႄႉ တေပဵၼ်ဢင်းၵိတ်ႉ ယဝ်ႉၶႃႈ၊ ယွၼ်ႉပိူဝ်ႈဝႃႈ ဢင်းၵိတ်ႉၼႆႉ ပဵၼ်လၵ်းထၢၼ် ၵၢၼ်ႁဵၼ်း တွၼ်ႈတႃႇ လုၵ်ႈႁဵၼ်းသင်ႇၶတႆး ၸိူဝ်းဢၼ်မႃး ၶိုၼ်ႈႁဵၼ်း လဵပ်ႈႁဵၼ်း ပႆၸွမ်းၶၢဝ်းတၢင်းသီႇႁူဝ်ႇၼႆႉ|ၼႆလႄႈ ၸဝ်ႈၶူးလူင် Prof. Dr. ၶမ်းမၢႆ ထမ်မသႃမိ ၸင်ႇဢွၼ်ႁူဝ်တႄႇတင်ႈ မႃးပၼ်ႁဝ်းၶႃႈ ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉ ႁႂ်ႈပေႃးလႆႈတူၺ်းလႆႈ တွင်းလႆႈ ႁႃႈလႆႈၶေႃႈမုၼ်း ၵဵဝ်ႇလူၺ်ႈၵၢၼ်ႁဵၼ်း၊ ဢၼ်ပဵၼ် တၢင်းၸွႆႈထႅမ်လွင်ႈၵၢၼ်ႁဵၼ်းလႆႈ ငၢႆႈငၢႆႈၼႆၼၼ်ႉယူႇယဝ်ႉ၊ ၼႆလႄႈ ၸိူဝ်းပဵၼ်လုၵ်ႈၼွင်ႉၸဝ်ႈၶူးလူင်ၵေႃႈ ၸင်ႇလႆႈသိုပ်ႇ ထိင်းသိမ်း၊ သိုပ်ႇႁဵၼ်းၵၢၼ်ၵႂႃႇ ပၢၼ်သိုပ်ႇပၢၼ်ယူႇၶႃႈယဝ်ႉ။|မိူဝ်ႈဝၼ်းထိ 08.06.2017 ပၢင်ႁူပ်ႉထူပ်း ၸုမ်းဝႆႈၽြႃး ဢၼ်ၸဝ်ႈၶူး မုၼိဝရ (လၢႆးၶႃႈ) လႄႈ ၸဝ်ႈၶူး ၺႃၼဝရ (ၵျွၵ်းမႄး) ဢွၼ်ႁူဝ် တၵ်ႉၵႃသထႃး ဝဵင်းၵျွၵ်းမႄး၊ ဝဵင်းလၢႆးၶႃႈ လႄႈ သင်ႇၶတႆး မိူင်းသီႇႁူဝ်ႇ။|ယိူင်းၸူးလူႇတၢၼ်းပၼ် ၶိင်းသူၺ်ႇၵျွင်းဢၢႆႈၽူင်း ၼိုင်ႈလိူၼ်တဵမ် (AD. 1943-2017)|(ၽူႈဢၼ်လူႇတၢၼ်းယႂ်ႇလူင် ၼႂ်းၽႃႇသႃႇ၊ သႃႇသၼႃႇ၊ လႄႈ ၼႃႈယၵ ၵေႃလိၵ်ႈလၢႆးလႄႈ ၽင်ႈ|ငႄႈတႆး ဝဵင်းလိူဝ်ႇ) ဢၼ်သဵင်ႈၵႂႃႇ တဵမ်ၼိုင်ႈလိူၼ်သေ ၼၢႆးသူၺ်ႇၵျွင်းဢီႇသၢဝ် လႄႈ|လုၵ်ႈလၢၼ်တင်းသဵင်ႈ (မူႇၸေႈ၊ လႃႈသဵဝ်ႈ၊ တႃႈလိူဝ်ႇ၊ တႃႈၵုင်ႈ) လူႇတၢၼ်းၵၢပ်ႈသွမ်းဢေႃႈ။ (10/06/2017)|သွၵ်ႈႁႃ|Search for:|Search|ၽႃႇသႃႇတႆး:|English|Shan|ၿုၻ်ꩪၸဝ်ႈ|ပၵ်းယဵမ်ႈဝၼ်း|December 2019|M|T|W|T|F|S|S|« Sep|1|2 3 4 5 6 7 8|9 10 11 12 13 14 15|16 17 18 19 20 21 22|23 24 25 26 27 28 29|30 31|ၾဵတ်ႉၶၢဝ်းတၢင်းသီႇႁူဝ်ႇ|ပပ်ႉၸဝ်ႈၶူး Dr. မႁေႃသထႃလင်ၵႃရႃၽိဝမ်သ|ၶၢဝ်ႇမိူဝ်ႈလဵဝ်|ႁၢင်ႈၽၢင်မၢႆတွင်း ႁပ်ႉၸုမ်ႈၶူး M.A & P.G.D တီႈၸၼ်ႉၸွမ် ၵေႇလၼိယ လႄႈ ပၢင်ႁူပ်ႉထူပ်းလုၵ်ႈႁဵၼ်းသင်ႇၶတႆး မိူင်းသီႇႁူဝ်ႇ|ပွင်ႇလႅင်းထိုင် ၽူႈႁၵ်ႉလိၵ်ႈလၢႆးပၢႆပႄႇႁဝ်းၶဝ်တင်းသဵင်ႈတီႈၶႃႈ ဢမ်ႇပေႃးႁိုင်ပေႃးၼၢၼ်း ဢၼ်တေပဵၼ်ပပ်ႉမၢႆတွင်း ပီၵွၼ်းၶမ်း (50) ပီတဵမ်|ယွၼ်းမူႇလိၵ်ႈၶၢဝ်းတၢင်းသီႇႁူဝ်ႇ မၢႆ-21 ၊ ပီ – 2019 ထိုင် ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉႁဝ်းၶဝ်တင်းသဵင်ႈ|လုၵ်ႈႁဵၼ်းသင်ႇၶတႆး လႆႈၸုမ်ႈ M.A တီႈၸၼ်ႉၸွမ် Buddhist & Pāli Unversity မိူင်းသီႇရိလင်းၵႃ၊ သီႇၸဝ်ႈ|သဵၼ်ႈမၢႆၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉ ၸိူဝ်းဢွင်ႇပူၼ်ႉၸၼ်ႉလိၵ်ႈၼႂ်းပီ 2018|ၶေႃႈပၼ်တၢင်းၶႆႈၸႂ်|ႁွင်ႈၶၢဝ်ႇ|September 2019|August 2019|June 2019|February 2019|December 2018|November 2018|October 2018|September 2018|August 2018|July 2018|May 2018|April 2018|March 2018|December 2017|November 2017|October 2017|September 2017|August 2017|July 2017|June 2017|May 2017|မူႇၵေႃ|ၵၢၼ်တူင်ႉၼိုင်|ၶၢဝ်ႇၼႂ်းမိူင်း|News|Uncategorized|ၶၢဝ်ႇၼွၵ်ႈမိူင်း|Meta|Log in|Entries RSS|Comments RSS|WordPress.org|တူဝ်ၼပ်ႉၵူၼ်းမႃးယဵမ်ႈ|Copyright 2015 © 2017 Cjtaisangha.com |Makutarama Temple 42/15 Reservoir Road, Dematagoda, Colombo 09, Sri Lanka Tel: 009 411 2662488 Email: Lankajourney@yahoo.co.uk |Ribosome by GalussoThemes.com|Powered by WordPress|\n', 'Target_URL': 'http://www.cjtaisangha.com/activities/'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this example now that we're using the Translation
feature type please ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I missed the README. Sorry.
An instance of `sentences` type: | ||
|
||
``` | ||
{'LASER_similarity': 1.2734256982803345, 'Source_Sentence': '>>> PhD Students in 2018', 'Target_Sentence': '>>> လုၵ်ႈႁဵၼ်းၸၼ်ႉ Ph.D ၼႂ်းပီ 2018', 'from_english': True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
- `Source_URL`: a `string` feature containing the source URL. | ||
- `Source_Content`: a `string` feature containing the content on Source_URL. | ||
- `Target_URL`: a `string` feature containing the target URL. | ||
- `Target_Content`: a `string` feature containing the content on Target_URL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Source_Content
and Target_Content
fields are now replaced by a field translation
with two subfields.
One subfield is en_XX
and the other one is the other language code.
For `sentences` type: | ||
|
||
- `Source_Sentence`: a `string` feature containig the source sentence. | ||
- `Target_Sentence`: a `string` feature containing the target sentence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
Hi @lhoestq, I have changed the README, and added a single example per config. Even one example is long enough to make the files heavy. Hope that isn't an issue. Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes !
LGTM :)
Hi @lhoestq, Thanks for approving. |
Hello,
I'm trying to add CCAligned Multilingual Dataset. This has the potential to close #1756.
This dataset has two types - Document-Pairs, and Sentence-Pairs.
The datasets are huge, so I won't be able to test all of them. At the same time, a user might only want to download one particular language and not all. To provide this feature,
load_dataset
's**config_kwargs
should allow some random keyword args, in this case -language_code
. This will be needed before the dataset is downloaded and extracted.I'm expecting the usage to be something like -
load_dataset('ccaligned_multilingual','documents',language_code='en_XX-af_ZA')
. Ofcourse, at a later stage we can provide just two character language codes. This also has an issue where one language has multiple files (my_MM
andmy_MM_zaw
on the link), but before that the required functionality must be added toload_dataset
.It would be great if someone could either tell me an alternative way to do this, or point me to where changes need to be made, if any, apart from the
BuilderConfig
definition.Additionally, I believe the tests will also have to be modified if this change is made, since it would not be possible to test for any random keyword arguments.
A decent way to go about this would be to provide all the options in a list/dictionary for
language_code
and use that to test the arguments. In essence, this is similar to the pre-trained checkpoint dictionary astransformers
. That means writing dataset specific tests, or adding something new to dataset generation script to make it easier for everyone to add keyword arguments without having to worry about the tests.Thanks,
Gunjan
Requesting @lhoestq / @yjernite to review.