Skip to content

Latest commit

 

History

History
53 lines (37 loc) · 1.91 KB

TDD-C-202109-CC-002.md

File metadata and controls

53 lines (37 loc) · 1.91 KB

Turkish Twitter dataset for Offensive Speech Identification in Social Media.

The original dataset of OffensEval2020 is presented in five languages. We share only the Turkish subset.

Dataset Curation

This dataset was collected from Twitter, where the tweets are annotated for offensive speech with offensive or non-offensive labels (Çöltekin, 2020).

There is no validation split provided in the original source of this dataset. Hence, we create our own split from the original training split.

Annotation Quality

For details on the annotation guidelines and inter-annotator agreement rates see the original paper Çöltekin, 2020.

Dataset Format

We share the dataset using .jsonlines format with UTF-8 encoding.

{ "text" : "buralara değil yaz günü, kışın bile kar yağmıyor", "label" : "not-offensive" }

Dataset Statistics

Offenseval
Avg. #words 8.5
#Classes 2
Training 28,000
Validation 3277
Test 3515
Total 34792

Citation

@inproceedings{coltekin-2020-corpus,
    title = "A Corpus of {T}urkish Offensive Language on Social Media",
    author = {{\c{C}}{\"o}ltekin, {\c{C}}a{\u{g}}r{\i}},
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.lrec-1.758",
    pages = "6174--6184"
}

Contact

Uploaded and documented by Ali Safaya: alisafaya gmail com