Skip to content

A collection of various NLP datasets, mainly Indonesia-related languages.

Notifications You must be signed in to change notification settings

w11wo/nlp-datasets

Repository files navigation

NLP Datasets

A collection of various NLP datasets, mainly Indonesia-related languages. These datasets are split into two: pre-training corpora and fine-tuning datasets.


Table of Contents


Pre-training Corpora

English IMDb movie review dataset translated to Javanese using multi-lingual MarianMT Transformer Helsinki-NLP/opus-mt-en-mul.

Split (47.5 MB):

  • 25,000 Train
  • 25,000 Test
  • 50,000 Unsupervised

Javanese Wikipedia documents from Wikidump. Collected on December 2020.

Split (319 MB):

  • 80,067 Unsupervised

Javanese Wikipedia documents from Wikidump. Collected on June 2021.

Split (342 MB):

  • 84,507 Unsupervised

Sundanese Wikipedia documents from Wikidump. Collected on June 2021.

Split (279 MB):

  • 68,920 Unsupervised

Minangkabau Wikipedia documents from Wikidump. Collected on June 2021.

Split (911 MB):

  • 229,869 Unsupervised

"OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture."

Split^:

  • Indonesian (16 GB)
  • Javanese (583 KB)
  • Sundanese (141 KB)
  • Minangkabau (310 KB)

*external resource

^deduplicated


"This corpus comprises of monolingual data for 100+ languages [...] This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots."

Split:

  • Indonesian (36 GB)
  • Javanese (37 MB)
  • Sundanese (15 MB)

*external resource


"A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset."

Split:

  • Indonesian (242 GB)
  • Javanese (876 MB)
  • Sundanese (464 MB)

*external resource


Fine-tuning Datasets

O. V. Putra, F. M. Wasmanson, T. Harmini, and S. N. Utama, “Sundanese twitter dataset for emotion classification,” virtual, Nov. 2020.

Split (235 KB):

  • 2,518 Train
  • 12 Test

*external resource


TBA

About

A collection of various NLP datasets, mainly Indonesia-related languages.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published