Skip to content

Latest commit

 

History

History
41 lines (35 loc) · 1.9 KB

README.md

File metadata and controls

41 lines (35 loc) · 1.9 KB

small_parallel_enja: 50k En/Ja Parallel Corpus for Testing SMT Methods

This directory includes a small parallel corpus for English-Japanese translation task. These data are extracted from TANAKA Corpus by filtering sentence length 4 to 16 words.

English sentences are tokenized using Stanford Tokenizer and lowercased. Japanese sentences are tokenized using KyTea.

All texts are encoded in UTF-8. Sentence separator is '\n' and word separator is ' '.

Attention: some English words have different tokenization results from Stanford Tokenizer, e.g., "don't" -> "don" "'t", which may came from preprocessing errors. Please take care of using this dataset in token-level evaluation.

Corpus Statistics

File #sentences #words #vocabulary
train.en 50,000 391,047 6,634
- train.en.000 10,000 78,049 3,447
- train.en.001 10,000 78,223 3,418
- train.en.002 10,000 78,427 3,430
- train.en.003 10,000 78,118 3,402
- train.en.004 10,000 78,230 3,405
train.ja 50,000 565,618 8,774
- train.ja.000 10,000 113,209 4,181
- train.ja.001 10,000 112,852 4,102
- train.ja.002 10,000 113,044 4,105
- train.ja.003 10,000 113,346 4,183
- train.ja.004 10,000 113,167 4,174
dev.en 500 3,931 816
dev.ja 500 5,668 894
test.en 500 3,998 839
test.ja 500 5,635 884