WikiANN dataset is a semi-automatically annotated dataset from Turkish Wikipedia articles in 2017.
This dataset is presented by Pan et al. (2017). It was collected from Turkish Wikipedia, with semi-automatic annotations for three different entity types: Person, Location, Organization.
WikiANN | Training | Validation | Test |
---|---|---|---|
Location | 9679 | 5014 | 4914 |
Organization | 7970 | 4129 | 4154 |
Person | 8833 | 4374 | 4519 |
Total words | 149786 | 75930 | 75731 |
The dataset is provided in one-word-per-line BIO format:
Bu O
törene O
Usher B-ORG
, O
Mariah B-PER
Carey I-PER
, O
Lionel B-PER
Richie I-PER
, O
Jennifer B-PER
Hudson I-PER
, O
Brooke B-PER
Shields I-PER
gibi O
ünlüler O
katıldı O
. O
@inproceedings{pan-etal-2017-cross,
title = "Cross-lingual Name Tagging and Linking for 282 Languages",
author = "Pan, Xiaoman and
Zhang, Boliang and
May, Jonathan and
Nothman, Joel and
Knight, Kevin and
Ji, Heng",
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2017",
address = "Vancouver, Canada",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/P17-1178",
doi = "10.18653/v1/P17-1178",
pages = "1946--1958"
}
Uploaded and documented by Ali Safaya: alisafaya gmail com