Small Japanese-English Subtitle Corpus. Sentences are extracted from JESC: Japanese-English Subtitle Corpus, and filtered with the length of 4 to 16 words.
Both Japanese and English sentences are tokenized with StanfordNLP (v0.2.0).
All texts are encoded in UTF-8. Sentence separator is '\n'
and word separator is ' '
.
Additionally, all tokenized data can be downloaded from here.
File | #sentences | #words | #vocabulary |
---|---|---|---|
train.en | 100,000 | 809,353 | 29,682 |
train.ja | 100,000 | 808,157 | 46,471 |
dev.en | 1,000 | 8,025 | 1,827 |
dev.ja | 1,000 | 8,163 | 2,340 |
test.en | 1,000 | 8,057 | 1,805 |
test.ja | 1,000 | 8,084 | 2,306 |
This repo is inspired by small_parallel_enja.