Skip to content

Latest commit

 

History

History
322 lines (319 loc) · 49.1 KB

Wiki.md

File metadata and controls

322 lines (319 loc) · 49.1 KB

Tatoeba Challenge Data - Wikimedia data

This is part of the Tatoeba Translation Challenge Data set. The following monolingual data sets are extracted from CirrusSearch Wikimedia dumps including:

  • Wikipedia
  • Wikibooks
  • Wikinews
  • Wikiquote
  • Wikisource

All data sets are in UTF8 plain text, one sentence per line. We provide a deduplicated shuffled download and a complete download with document boundaries (empty lines). Simple pre-processing like unicode character normalisation and language-identification-based filtering has been applied to reduce some noise. The extraction scripts are part of OPUS-MT.