Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for Indo_MultiModal_CC_12M #307

Open
SamuelCahyawijaya opened this issue Oct 2, 2022 · 1 comment
Open

Create dataset loader for Indo_MultiModal_CC_12M #307

SamuelCahyawijaya opened this issue Oct 2, 2022 · 1 comment
Assignees

Comments

@SamuelCahyawijaya
Copy link
Member

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?id_mm_cc_12m

Dataset id_mm_cc_12m
Description Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M). Indo_MultiModal_CC_12M is the Indonesian language version.
License The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
@acul3
Copy link
Contributor

acul3 commented Oct 4, 2022

#self-assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants