Skip to content

Latest commit

 

History

History

jv-text

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Javanese Text

[Dataset Download]

Extracting Javanese text from available data. The data used comes from:

Process

  1. Detect language with CLD3
  2. Score sentence with KenLM with Javanese Wikipedia
  3. Dedupe with simple awk

Citations

@inproceedings{wenzek2020ccnet,
  title={CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data},
  author={Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm{\'a}n, Francisco and Joulin, Armand and Grave, {\'E}douard},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={4003--4012},
  year={2020}
}