Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CCAligned Multilingual Dataset #1815

Merged

Conversation

gchhablani
Copy link
Contributor

@gchhablani gchhablani commented Feb 3, 2021

Hello,

I'm trying to add CCAligned Multilingual Dataset. This has the potential to close #1756.

This dataset has two types - Document-Pairs, and Sentence-Pairs.

The datasets are huge, so I won't be able to test all of them. At the same time, a user might only want to download one particular language and not all. To provide this feature, load_dataset's **config_kwargs should allow some random keyword args, in this case -language_code. This will be needed before the dataset is downloaded and extracted.

I'm expecting the usage to be something like -
load_dataset('ccaligned_multilingual','documents',language_code='en_XX-af_ZA'). Ofcourse, at a later stage we can provide just two character language codes. This also has an issue where one language has multiple files (my_MM and my_MM_zaw on the link), but before that the required functionality must be added to load_dataset.

It would be great if someone could either tell me an alternative way to do this, or point me to where changes need to be made, if any, apart from the BuilderConfig definition.

Additionally, I believe the tests will also have to be modified if this change is made, since it would not be possible to test for any random keyword arguments.

A decent way to go about this would be to provide all the options in a list/dictionary for language_code and use that to test the arguments. In essence, this is similar to the pre-trained checkpoint dictionary as transformers. That means writing dataset specific tests, or adding something new to dataset generation script to make it easier for everyone to add keyword arguments without having to worry about the tests.

Thanks,
Gunjan

Requesting @lhoestq / @yjernite to review.

@lhoestq
Copy link
Member

lhoestq commented Feb 4, 2021

Hi !

We already have some datasets that can have many many configurations possible.
To be able to support that, we allow to subclass BuilderConfig to add as many additional parameters as you may need.
This way users can load any language they want. For example the bible_para dataset is a dataset for translation and therefore users should be able to provide any language pair. You can check how the subclass of BuilderConfig is defined here.

For testing, only the configurations defined in the BUILDER_CONFIGS class attribute are used.
All the other configs combinations are not tested, but they can be used by users. If a config doesn't already exist in BUILDER_CONFIGS, then it is created on the fly.
For example in bible_para, only 6 configs are defined in BUILDER_CONFIGS.

So what I would do in your case is have something like

class CCAlignedConfig(datasets.BuilderConfig):
    def __init__(self, *args, documents_or_sentences=None, language_code=None, **kwargs):
        super().__init__(
            *args,
            name=f"{documents_or_sentences}-{language_code}",
            **kwargs,
        )
        self.documents_or_sentences = documents_or_sentences
        self.language_code = language_code

And of course, feel free to change/rename things if you want to. In particular I think we can improve the name of the parameter documents_or_sentences

@gchhablani
Copy link
Contributor Author

Hi @lhoestq,

Thanks a lot! I don't know why I didn't think about that. :P
I'll make these changes and update.

@gchhablani gchhablani marked this pull request as ready for review February 4, 2021 19:30
@gchhablani
Copy link
Contributor Author

Hi @lhoestq,

I have tested and added dummy files. Request you to review.

Also, does this mean BUILDER_CONFIGS is only needed while testing?

@gchhablani
Copy link
Contributor Author

Hi @lhoestq,

Any changes required on this one?

Thanks,
Gunjan

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Sorry for the delay ^^'

That's awesome thanks ! Good job with the dataset config
I left a few comments.

Also could you try to reduce the size of the dummy data for documents-ak_GH please ? It's currently 3.8MB and it would be awesome to have something less than 20KB

datasets/ccaligned_multilingual/README.md Outdated Show resolved Hide resolved
datasets/ccaligned_multilingual/README.md Outdated Show resolved Hide resolved
datasets/ccaligned_multilingual/README.md Outdated Show resolved Hide resolved
datasets/ccaligned_multilingual/README.md Outdated Show resolved Hide resolved
datasets/ccaligned_multilingual/README.md Show resolved Hide resolved
datasets/ccaligned_multilingual/ccaligned_multilingual.py Outdated Show resolved Hide resolved
datasets/ccaligned_multilingual/ccaligned_multilingual.py Outdated Show resolved Hide resolved
datasets/ccaligned_multilingual/ccaligned_multilingual.py Outdated Show resolved Hide resolved
datasets/ccaligned_multilingual/README.md Show resolved Hide resolved
datasets/ccaligned_multilingual/README.md Outdated Show resolved Hide resolved
@gchhablani
Copy link
Contributor Author

Hi @lhoestq,

Sorry for the delay, I have added the changes from the review. For the ISO format language codes, I just selected the first two characters from the names, hoping those are correct. Let me know if you want me to verify :P

Thanks for taking the time to add such a detailed review. I'll keep all these changes in mind the next time I'm adding a dataset.

Thanks,
Gunjan

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes :)

I added just a few comments in the dataset card to take into account the change to the Translation feature type.

Also it looks like the dummy_data.zip file for the ak_GH configuration is quite big (3MB), can you try to reduce its size ? Ideally it should be just a few KB like for the other configs.

An instance of `documents` type:

```
{'Domain': 'cjtaisangha.com', 'Source_Content': 'Activities|Search|Search|Home|News|News for education|Our Club|B.A. Students|PGD Students|M.A. Students|M.Phil Studensts|Ph.D. Students|List of CJ Students|Dhamma|MP3|MP-4|Activities|CJTS.Members|E-Books|Ceylon Journey News|SGS Libaray|book list|English|Thai|Shan|Myanmar|Download|Gallery|Contact us|Donate|Links|Skip to content|Home|News|News for education|Our Club|B.A. Students|PGD Students|M.A. Students|M.Phil Studensts|Ph.D. Students|List of CJ Students|Dhamma|MP3|MP-4|Activities|CJTS.Members|E-Books|Ceylon Journey News|SGS Libaray|book list|English|Thai|Shan|Myanmar|Download|Gallery|Contact us|Donate|Links|Activities|Sorry, this entry is only available in Shan.|(Shan) သွၵ်ႈႁႃ|Search for:|Search|English:|English|Shan|Buddha|Calender|December 2019|M|T|W|T|F|S|S|« Sep|1|2 3 4 5 6 7 8|9 10 11 12 13 14 15|16 17 18 19 20 21 22|23 24 25 26 27 28 29|30 31|Cjtaisangha’s facebook|(Shan) ပပ်ႉၸဝ်ႈၶူး Dr. မႁေႃသထႃလင်ၵႃရႃၽိဝမ်သ|(Shan)|Recent Posts|(Shan) ႁၢင်ႈၽၢင်မၢႆတွင်း ႁပ်ႉၸုမ်ႈၶူး M.A & P.G.D တီႈၸၼ်ႉၸွမ် ၵေႇလၼိယ လႄႈ ပၢင်ႁူပ်ႉထူပ်းလုၵ်ႈႁဵၼ်းသင်ႇၶတႆး မိူင်းသီႇႁူဝ်ႇ|(Shan) ပွင်ႇလႅင်းထိုင် ၽူႈႁၵ်ႉလိၵ်ႈလၢႆးပၢႆပႄႇႁဝ်းၶဝ်တင်းသဵင်ႈတီႈၶႃႈ ဢမ်ႇပေႃးႁိုင်ပေႃးၼၢၼ်း ဢၼ်တေပဵၼ်ပပ်ႉမၢႆတွင်း ပီၵွၼ်းၶမ်း (50) ပီတဵမ်|(Shan) ယွၼ်းမူႇလိၵ်ႈၶၢဝ်းတၢင်းသီႇႁူဝ်ႇ မၢႆ-21 ၊ ပီ – 2019 ထိုင် ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉႁဝ်းၶဝ်တင်းသဵင်ႈ|(Shan) လုၵ်ႈႁဵၼ်းသင်ႇၶတႆး လႆႈၸုမ်ႈ M.A တီႈၸၼ်ႉၸွမ် Buddhist & Pāli Unversity မိူင်းသီႇရိလင်းၵႃ၊ သီႇၸဝ်ႈ|Congraduration to you all|Recent Commens|Resent Ports|September 2019|August 2019|June 2019|February 2019|December 2018|November 2018|October 2018|September 2018|August 2018|July 2018|May 2018|April 2018|March 2018|December 2017|November 2017|October 2017|September 2017|August 2017|July 2017|June 2017|May 2017|(Shan) မူႇၵေႃ|Activities|Local News|News|Uncategorized|World News|Meta|Log in|Entries RSS|Comments RSS|WordPress.org|Visiter|Copyright 2015 © 2017 Cjtaisangha.com |Makutarama Temple 42/15 Reservoir Road, Dematagoda, Colombo 09, Sri Lanka Tel: 009 411 2662488 Email: Lankajourney@yahoo.co.uk |Ribosome by GalussoThemes.com|Powered by WordPress|', 'Source_URL': 'http://www.cjtaisangha.com/en/activities/', 'Target_Content': 'ၵၢၼ်တူင်ႉၼိုင်|Search|Search|ၼႃႈႁူဝ်ႁႅၵ်ႈ|ၶၢဝ်ႇၵူႈလွင်ႈ|ၶၢဝ်ႇလွင်ႈၵၢၼ်ႁဵၼ်း|ၽႂ်ပဵၼ်ၽႂ်ၼႂ်းႁဝ်းႁႃး|လုၵ်ႈႁဵၼ်းၸၼ်ႉ BA|လုၵ်ႈႁဵၼ်းၸၼ်ႉ PGD|လုၵ်ႈႁဵၼ်းၸၼ်ႉ MA|လုၵ်ႈႁဵၼ်းၸၼ်ႉ M.Phil|လုၵ်ႈႁဵၼ်းၸၼ်ႉ PhD|သဵၼ်ႈမၢႆလုၵ်ႈႁဵၼ်းၵူႈပီပီ|ထမ်ႇမ|MP3|MP-4|ၵၢၼ်တူင်ႉၼိုင်|ၸုမ်းၽူႈပွင်ၵၢၼ်|ႁွင်ႈပပ်ႉလိၵ်ႈ|ၶၢဝ်းတၢင်းသီႇႁူဝ်ႇ|ႁူင်းတူၺ်းလိၵ်ႈ SGS|သဵၼ်ႈမၢႆပပ်ႉ|ဢင်းၵိတ်ႉ|ၽၢႆႇထႆး|ၽၢႆႇတႆး|ၽၢႆႇမၢၼ်ႈ|တႃႇလုတ်ႇလူင်း|ၶႅပ်းႁၢင်ႈ|တီႈၵပ်းသိုပ်ႇ|ၵပ်းသိုပ်ႇလူႇတၢၼ်း|လိင်ႉ|Skip to content|ၼႃႈႁူဝ်ႁႅၵ်ႈ|ၶၢဝ်ႇၵူႈလွင်ႈ|ၶၢဝ်ႇလွင်ႈၵၢၼ်ႁဵၼ်း|ၽႂ်ပဵၼ်ၽႂ်ၼႂ်းႁဝ်းႁႃး|လုၵ်ႈႁဵၼ်းၸၼ်ႉ BA|လုၵ်ႈႁဵၼ်းၸၼ်ႉ PGD|လုၵ်ႈႁဵၼ်းၸၼ်ႉ MA|လုၵ်ႈႁဵၼ်းၸၼ်ႉ M.Phil|လုၵ်ႈႁဵၼ်းၸၼ်ႉ PhD|သဵၼ်ႈမၢႆလုၵ်ႈႁဵၼ်းၵူႈပီပီ|ထမ်ႇမ|MP3|MP-4|ၵၢၼ်တူင်ႉၼိုင်|ၸုမ်းၽူႈပွင်ၵၢၼ်|ႁွင်ႈပပ်ႉလိၵ်ႈ|ၶၢဝ်းတၢင်းသီႇႁူဝ်ႇ|ႁူင်းတူၺ်းလိၵ်ႈ SGS|သဵၼ်ႈမၢႆပပ်ႉ|ဢင်းၵိတ်ႉ|ၽၢႆႇထႆး|ၽၢႆႇတႆး|ၽၢႆႇမၢၼ်ႈ|တႃႇလုတ်ႇလူင်း|ၶႅပ်းႁၢင်ႈ|တီႈၵပ်းသိုပ်ႇ|ၵပ်းသိုပ်ႇလူႇတၢၼ်း|လိင်ႉ|ၵၢၼ်တူင်ႉၼိုင်|မိူဝ်ႈဝၼ်းထိ 19-20 May 2018 ၼၼ်ႉ ၸဝ်ႈသြႃႇဝိၸယႃၽိပႃလ ပူင်သွၼ်ပၼ် ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉ ဢၼ်မႃးႁဵၼ်းလိၵ်ႈ တီႈမိူင်းသီႇႁူဝ်ႇၼႆႉ၊ ၼင်ႇႁိုဝ် မိူဝ်းၼႃႈထႃႈပၢႆမႃး ပေႃႈတေမီးလွင်ႈ ယုမ်ႇတူဝ်ယုမ်ႇၸႂ်သေဢမ်ႇၵႃး ၼင်ႇႁိုဝ်ပေႃးတေ မီးလွင်ႈတူဝ်ႈတၼ်းလီ တွၼ်ႈတႃႇႁဵၼ်းလႆႈထိုင်ၸၼ်ႉသုင်သုင်ၼႆၼၼ်ႉလႄႈ ၸဝ်ႈသြႃႇဝိၸယႃၽိပႃလ ၸင်ႇလႆႈမီးၸႂ် မဵတ်ႉတႃႇယႂ်ႇၼိူဝ် ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉပေႃးတေမီးလွင်ႈတၢင်းႁူႉတၢင်းႁၼ် ၼိူဝ်လိၵ်ႈလၢႆးပၢႆပႄႇၼႆသေ ပူင်သွၼ်ပၼ် လွၵ်းလၢႆးတႅမ်ႈလိၵ်ႈ M.Phil တီႈဝတ်ႉမၵုတႃႇရႃႇမ၊ ႁွင်ႈတူၺ်းလိၵ်ႈ ၸဝ်ႈၵၢင်းသိူဝ် ၵႂႃႇၼႆယူႇယဝ်ႉ။|တေလႆႈႁဵတ်းႁိုဝ် ဝႆႉဝၢင်းတူၼ်ႈထႅဝ်၊ တေလႆႈႁဵတ်းႁိုဝ်ၶပ်ႉလိၵ်ႈမႅၼ်ႈမႅၼ်ႈၸွမ်းပိူင်၊ တေလႆႈႁဵတ်းႁိုဝ်သႂ်ႇ Footnote , တေႁဵတ်းႁိုဝ် သႂ်ႇၽိုၼ်ဢိင် တႄႇၵႂႃႇၸိူဝ်းၼႆႉ တေလႆႈဝႃႈ ပၼ်တၢင်းႁူႉႁၼ် တၢင်းၼမ်တၢင်းလၢႆယူႇ။ ပဵၼ်ဢၼ်လီမၢႆ လီတွင်းဝႆႉတႄႉတႄႉယူႇယဝ်ႉ။|ၶၢဝ်းတၢင်းသီႇႁူဝ်ႇ|မိူဝ်ႈဝၼ်းထိ 12.04.2018 ၸဝ်ႈသြႃႇဝိၸယႃၽိပႃလ ဢွၼ်ႁူဝ် ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉ လုၵ်ႈႁဵၼ်းသင်ႇၶတႆး ၊ မိူင်းသီႇႁူဝ်ႇ ၶပ်ႉပပ်ႉလိၵ်ႈ၊ ႁူင်းတူၺ်းလိၵ်ႈ ၸဝ်ႈၵၢင်းသိူဝ် ၶၢဝ်းတၢင်းတႄႇၶပ်ႉပပ်ႉလိၵ်ႈ တႄႇဝၼ်းထိ 10 – 20. 04 . 2018 ၼႆယူႇယဝ်ႉ။|ၸိူဝ်းပဵၼ်ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉ ၸိူဝ်းမီးၶၢဝ်းယၢမ်းတူဝ်ႈတၼ်းၸိူဝ်းၼၼ်ႉ လႄႈ မၢင်ၸိူဝ်းၵေႃႈ တတ်းဢဝ်ၶၢဝ်းယၢမ်း ၼႂ်းၵႄႈဢမ်ႇတၼ်းၼၼ်ႉသေ မႃးၸွႆႈမႃးထႅမ် မိူၼ်ၼင်ႇဝႃႈ ႁူမ်ႈၵၼ်ၵိၼ်ၸင်ႇဝၢၼ် ႁူမ်ႈၵၼ်ႁၢမ်ၸင်ႇမဝ် ၼႆၼၼ်ႉ ယဝ်ႉ။ တွၼ်ႈတႃႇၼိုင်ႈပီၼိုင်ႈပီၼႆႉ ဢၼ်ပဵၼ်ပပ်ႉလိၵ်ႈ ဢင်းၵိတ်ႉ၊ တႆး ၊ ထႆး၊ မၢၼ်ႈ တႄႇၵႂႃႇၸိူဝ်းၼႆႉၵေႃႈ တိူဝ်းၼမ် မႃးတိၵ်းꧦ ၵူႈပီပီလႄႈ လႆႈ Update ပၼ်သဵၼ်ႈမၢႆမၼ်ႈယူႇ ၵူႈပီပီၼႆယဝ်ႉ။ ဢၼ်ၼမ်လိူဝ်ၼႆႉတႄႉ တေပဵၼ်ဢင်းၵိတ်ႉ ယဝ်ႉၶႃႈ၊ ယွၼ်ႉပိူဝ်ႈဝႃႈ ဢင်းၵိတ်ႉၼႆႉ ပဵၼ်လၵ်းထၢၼ် ၵၢၼ်ႁဵၼ်း တွၼ်ႈတႃႇ လုၵ်ႈႁဵၼ်းသင်ႇၶတႆး ၸိူဝ်းဢၼ်မႃး ၶိုၼ်ႈႁဵၼ်း လဵပ်ႈႁဵၼ်း ပႆၸွမ်းၶၢဝ်းတၢင်းသီႇႁူဝ်ႇၼႆႉ|ၼႆလႄႈ ၸဝ်ႈၶူးလူင် Prof. Dr. ၶမ်းမၢႆ ထမ်မသႃမိ ၸင်ႇဢွၼ်ႁူဝ်တႄႇတင်ႈ မႃးပၼ်ႁဝ်းၶႃႈ ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉ ႁႂ်ႈပေႃးလႆႈတူၺ်းလႆႈ တွင်းလႆႈ ႁႃႈလႆႈၶေႃႈမုၼ်း ၵဵဝ်ႇလူၺ်ႈၵၢၼ်ႁဵၼ်း၊ ဢၼ်ပဵၼ် တၢင်းၸွႆႈထႅမ်လွင်ႈၵၢၼ်ႁဵၼ်းလႆႈ ငၢႆႈငၢႆႈၼႆၼၼ်ႉယူႇယဝ်ႉ၊ ၼႆလႄႈ ၸိူဝ်းပဵၼ်လုၵ်ႈၼွင်ႉၸဝ်ႈၶူးလူင်ၵေႃႈ ၸင်ႇလႆႈသိုပ်ႇ ထိင်းသိမ်း၊ သိုပ်ႇႁဵၼ်းၵၢၼ်ၵႂႃႇ ပၢၼ်သိုပ်ႇပၢၼ်ယူႇၶႃႈယဝ်ႉ။|မိူဝ်ႈဝၼ်းထိ 08.06.2017 ပၢင်ႁူပ်ႉထူပ်း ၸုမ်းဝႆႈၽြႃး ဢၼ်ၸဝ်ႈၶူး မုၼိဝရ (လၢႆးၶႃႈ) လႄႈ ၸဝ်ႈၶူး ၺႃၼဝရ (ၵျွၵ်းမႄး) ဢွၼ်ႁူဝ် တၵ်ႉၵႃသထႃး ဝဵင်းၵျွၵ်းမႄး၊ ဝဵင်းလၢႆးၶႃႈ လႄႈ သင်ႇၶတႆး မိူင်းသီႇႁူဝ်ႇ။|ယိူင်းၸူးလူႇတၢၼ်းပၼ် ၶိင်းသူၺ်ႇၵျွင်းဢၢႆႈၽူင်း ၼိုင်ႈလိူၼ်တဵမ် (AD. 1943-2017)|(ၽူႈဢၼ်လူႇတၢၼ်းယႂ်ႇလူင် ၼႂ်းၽႃႇသႃႇ၊ သႃႇသၼႃႇ၊ လႄႈ ၼႃႈယၵ ၵေႃလိၵ်ႈလၢႆးလႄႈ ၽင်ႈ|ငႄႈတႆး ဝဵင်းလိူဝ်ႇ) ဢၼ်သဵင်ႈၵႂႃႇ တဵမ်ၼိုင်ႈလိူၼ်သေ ၼၢႆးသူၺ်ႇၵျွင်းဢီႇသၢဝ် လႄႈ|လုၵ်ႈလၢၼ်တင်းသဵင်ႈ (မူႇၸေႈ၊ လႃႈသဵဝ်ႈ၊ တႃႈလိူဝ်ႇ၊ တႃႈၵုင်ႈ) လူႇတၢၼ်းၵၢပ်ႈသွမ်းဢေႃႈ။ (10/06/2017)|သွၵ်ႈႁႃ|Search for:|Search|ၽႃႇသႃႇတႆး:|English|Shan|ၿုၻ်ꩪၸဝ်ႈ|ပၵ်းယဵမ်ႈဝၼ်း|December 2019|M|T|W|T|F|S|S|« Sep|1|2 3 4 5 6 7 8|9 10 11 12 13 14 15|16 17 18 19 20 21 22|23 24 25 26 27 28 29|30 31|ၾဵတ်ႉၶၢဝ်းတၢင်းသီႇႁူဝ်ႇ|ပပ်ႉၸဝ်ႈၶူး Dr. မႁေႃသထႃလင်ၵႃရႃၽိဝမ်သ|ၶၢဝ်ႇမိူဝ်ႈလဵဝ်|ႁၢင်ႈၽၢင်မၢႆတွင်း ႁပ်ႉၸုမ်ႈၶူး M.A & P.G.D တီႈၸၼ်ႉၸွမ် ၵေႇလၼိယ လႄႈ ပၢင်ႁူပ်ႉထူပ်းလုၵ်ႈႁဵၼ်းသင်ႇၶတႆး မိူင်းသီႇႁူဝ်ႇ|ပွင်ႇလႅင်းထိုင် ၽူႈႁၵ်ႉလိၵ်ႈလၢႆးပၢႆပႄႇႁဝ်းၶဝ်တင်းသဵင်ႈတီႈၶႃႈ ဢမ်ႇပေႃးႁိုင်ပေႃးၼၢၼ်း ဢၼ်တေပဵၼ်ပပ်ႉမၢႆတွင်း ပီၵွၼ်းၶမ်း (50) ပီတဵမ်|ယွၼ်းမူႇလိၵ်ႈၶၢဝ်းတၢင်းသီႇႁူဝ်ႇ မၢႆ-21 ၊ ပီ – 2019 ထိုင် ၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉႁဝ်းၶဝ်တင်းသဵင်ႈ|လုၵ်ႈႁဵၼ်းသင်ႇၶတႆး လႆႈၸုမ်ႈ M.A တီႈၸၼ်ႉၸွမ် Buddhist & Pāli Unversity မိူင်းသီႇရိလင်းၵႃ၊ သီႇၸဝ်ႈ|သဵၼ်ႈမၢႆၸဝ်ႈပီႈၸဝ်ႈၼွင်ႉ ၸိူဝ်းဢွင်ႇပူၼ်ႉၸၼ်ႉလိၵ်ႈၼႂ်းပီ 2018|ၶေႃႈပၼ်တၢင်းၶႆႈၸႂ်|ႁွင်ႈၶၢဝ်ႇ|September 2019|August 2019|June 2019|February 2019|December 2018|November 2018|October 2018|September 2018|August 2018|July 2018|May 2018|April 2018|March 2018|December 2017|November 2017|October 2017|September 2017|August 2017|July 2017|June 2017|May 2017|မူႇၵေႃ|ၵၢၼ်တူင်ႉၼိုင်|ၶၢဝ်ႇၼႂ်းမိူင်း|News|Uncategorized|ၶၢဝ်ႇၼွၵ်ႈမိူင်း|Meta|Log in|Entries RSS|Comments RSS|WordPress.org|တူဝ်ၼပ်ႉၵူၼ်းမႃးယဵမ်ႈ|Copyright 2015 © 2017 Cjtaisangha.com |Makutarama Temple 42/15 Reservoir Road, Dematagoda, Colombo 09, Sri Lanka Tel: 009 411 2662488 Email: Lankajourney@yahoo.co.uk |Ribosome by GalussoThemes.com|Powered by WordPress|\n', 'Target_URL': 'http://www.cjtaisangha.com/activities/'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this example now that we're using the Translation feature type please ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I missed the README. Sorry.

An instance of `sentences` type:

```
{'LASER_similarity': 1.2734256982803345, 'Source_Sentence': '>>> PhD Students in 2018', 'Target_Sentence': '>>> လုၵ်ႈႁဵၼ်းၸၼ်ႉ Ph.D ၼႂ်းပီ 2018', 'from_english': True}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

- `Source_URL`: a `string` feature containing the source URL.
- `Source_Content`: a `string` feature containing the content on Source_URL.
- `Target_URL`: a `string` feature containing the target URL.
- `Target_Content`: a `string` feature containing the content on Target_URL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Source_Content and Target_Content fields are now replaced by a field translation with two subfields.
One subfield is en_XX and the other one is the other language code.

For `sentences` type:

- `Source_Sentence`: a `string` feature containig the source sentence.
- `Target_Sentence`: a `string` feature containing the target sentence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@gchhablani
Copy link
Contributor Author

Hi @lhoestq,

I have changed the README, and added a single example per config. Even one example is long enough to make the files heavy. Hope that isn't an issue.

Thanks,
Gunjan

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes !
LGTM :)

datasets/ccaligned_multilingual/ccaligned_multilingual.py Outdated Show resolved Hide resolved
@lhoestq lhoestq merged commit cf3ce1f into huggingface:master Mar 1, 2021
@gchhablani
Copy link
Contributor Author

gchhablani commented Mar 1, 2021

Hi @lhoestq,

Thanks for approving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ccaligned multilingual translation dataset
2 participants