Turkish Twitter dataset for Offensive Speech Identification in Social Media.
The original dataset of OffensEval2020 is presented in five languages. We share only the Turkish subset.
This dataset was collected from Twitter, where the tweets are annotated for offensive speech with offensive or non-offensive labels (Çöltekin, 2020).
There is no validation split provided in the original source of this dataset. Hence, we create our own split from the original training split.
For details on the annotation guidelines and inter-annotator agreement rates see the original paper Çöltekin, 2020.
We share the dataset using .jsonlines
format with UTF-8 encoding.
{ "text" : "buralara değil yaz günü, kışın bile kar yağmıyor", "label" : "not-offensive" }
Offenseval | |
---|---|
Avg. #words | 8.5 |
#Classes | 2 |
Training | 28,000 |
Validation | 3277 |
Test | 3515 |
Total | 34792 |
@inproceedings{coltekin-2020-corpus,
title = "A Corpus of {T}urkish Offensive Language on Social Media",
author = {{\c{C}}{\"o}ltekin, {\c{C}}a{\u{g}}r{\i}},
booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2020.lrec-1.758",
pages = "6174--6184"
}
Uploaded and documented by Ali Safaya: alisafaya gmail com