Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ccaligned multilingual translation dataset #1756

Closed
flozi00 opened this issue Jan 20, 2021 · 0 comments · Fixed by #1815
Closed

Ccaligned multilingual translation dataset #1756

flozi00 opened this issue Jan 20, 2021 · 0 comments · Fixed by #1815
Labels
dataset request Requesting to add a new dataset

Comments

@flozi00
Copy link
Contributor

flozi00 commented Jan 20, 2021

Adding a Dataset

  • Name: name of the dataset
  • Description: short description of the dataset (or link to social media or blog post)
  • CCAligned consists of parallel or comparable web-document pairs in 137 languages aligned with English. These web-document pairs were constructed by performing language identification on raw web-documents, and ensuring corresponding language codes were corresponding in the URLs of web documents. This pattern matching approach yielded more than 100 million aligned documents paired with English. Recognizing that each English document was often aligned to mulitple documents in different target language, we can join on English documents to obtain aligned documents that directly pair two non-English documents (e.g., Arabic-French).
  • Paper: link to the dataset paper if available
  • https://www.aclweb.org/anthology/2020.emnlp-main.480.pdf
  • Data: link to the Github repository or current dataset location
  • http://www.statmt.org/cc-aligned/
  • Motivation: what are some good reasons to have this dataset
  • The authors says it's an high quality dataset.
  • it's pretty large and includes many language pairs. It could be interesting training mt5 on this task.

Instructions to add a new dataset can be found here.

@flozi00 flozi00 added the dataset request Requesting to add a new dataset label Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset request Requesting to add a new dataset
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant