Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

README.md

Javanese Text

[Dataset Download]

Extracting Javanese text from available data. The data used comes from:

CC100
data.statmt.org
Wikipedia

Process

Detect language with CLD3
Score sentence with KenLM with Javanese Wikipedia
Dedupe with simple awk

Citations

@inproceedings{wenzek2020ccnet,
  title={CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data},
  author={Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm{\'a}n, Francisco and Joulin, Armand and Grave, {\'E}douard},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={4003--4012},
  year={2020}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jv-text

jv-text

README.md

Javanese Text

Process

Citations

Files

jv-text

Directory actions

More options

Directory actions

More options

Latest commit

History

jv-text

Folders and files

parent directory

README.md

Javanese Text

Process

Citations