- Please check the release section for the latest version, also available at the Center for Computational Linguistics
- A complete description of this resource is available here: A Corpus of Native, Non-native and Translated Texts, LREC, 2016, PDF
- For the raw corpus, please check the dataset available here
- For the experiments presented in the ACL 2016 paper, please check the dataset available here
- For the experiments presented in the LREC 2016 paper, please check the dataset available here
- This is a monolingual English corpus of native, non-native and (human) translated texts extracted from the European Parliament. The translated texts from different source languages represent a subset of the Haifa Corpus of Translationese. We preserved the same annotation style and included an ID and the EU state that each member of the European Parliament represents.
- We hope this dataset will facilitate a unified comparative study of translations and language produced by highly fluent non-native speakers, two closely-related phenomena that have only been studied in isolation so far.
- For updates, please check the official repository
If you use this work in your research, please cite:
@InProceedings{enntt-corpus,
author = {Sergiu Nisioi and Ella Rabinovich and Liviu P. Dinu and Shuly Wintner},
title = {A Corpus of Native, Non-native and Translated Texts},
booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
year = {2016},
month = {may},
date = {23-28},
location = {Portoro\u{z}, Slovenia},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-9-1},
language = {english}
}
- *.tok files contain tha actual text uttered either in English by natives and non-natives or translated to English from other languages
- *.dat files contain the annotations corresponding to each line in the *.tok files.
- NAME - speaker's name as it appears in the written session
- LANGUAGE - original language in which the sentence was uttered
- SESSION_ID - the name of the corresponding protocol source file
- SEQ_SPEAKER_ID - sequential number of the speaker within a session
- STATE - the EU state represented by the MEP
- MEPID - the ID used by the Europarl website to display the MEPs online images
For more details about this particular dataset, mailto:sergiu.nisioi at gmail com or mailto:ellarabi at csweb dot haifa dot ac dot il