A collection of various NLP datasets, mainly Indonesia-related languages. These datasets are split into two: pre-training corpora and fine-tuning datasets.
English IMDb movie review dataset translated to Javanese using multi-lingual MarianMT Transformer
Helsinki-NLP/opus-mt-en-mul
.
Split (47.5 MB):
- 25,000 Train
- 25,000 Test
- 50,000 Unsupervised
Javanese Wikipedia documents from Wikidump. Collected on December 2020.
Split (319 MB):
- 80,067 Unsupervised
Javanese Wikipedia documents from Wikidump. Collected on June 2021.
Split (342 MB):
- 84,507 Unsupervised
Sundanese Wikipedia documents from Wikidump. Collected on June 2021.
Split (279 MB):
- 68,920 Unsupervised
Minangkabau Wikipedia documents from Wikidump. Collected on June 2021.
Split (911 MB):
- 229,869 Unsupervised
"OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture."
Split^:
- Indonesian (16 GB)
- Javanese (583 KB)
- Sundanese (141 KB)
- Minangkabau (310 KB)
*external resource
^deduplicated
"This corpus comprises of monolingual data for 100+ languages [...] This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots."
Split:
- Indonesian (36 GB)
- Javanese (37 MB)
- Sundanese (15 MB)
*external resource
"A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset."
Split:
- Indonesian (242 GB)
- Javanese (876 MB)
- Sundanese (464 MB)
*external resource
O. V. Putra, F. M. Wasmanson, T. Harmini, and S. N. Utama, “Sundanese twitter dataset for emotion classification,” virtual, Nov. 2020.
Split (235 KB):
- 2,518 Train
- 12 Test
*external resource
-
Wikipedia documents from other regional languages
Minangkabaumin
- Banyumasan/Basa Banyumasan
map-bms
- Acehnese
ace
- Gorontalo/Bahasa Hulontalo
gor
- Balinese
ban
- Banjar
bjn
- Madurese
mad
- Toba Batak Language
bbc
-
Online news
- Solopos Jagad Jawa (Javanese)
- BewaraJabar Warta Sunda (Sundanese)
-
Classification Datasets
- Indonesian
- Javanese
- Sundanese