Turkish Twitter dataset for Offensive Speech Identification in Social Media.

The original dataset of OffensEval2020 is presented in five languages. We share only the Turkish subset.

Dataset Curation

This dataset was collected from Twitter, where the tweets are annotated for offensive speech with offensive or non-offensive labels (Çöltekin, 2020).

There is no validation split provided in the original source of this dataset. Hence, we create our own split from the original training split.

Annotation Quality

For details on the annotation guidelines and inter-annotator agreement rates see the original paper Çöltekin, 2020.

Dataset Format

We share the dataset using .jsonlines format with UTF-8 encoding.

{ "text" : "buralara değil yaz günü, kışın bile kar yağmıyor", "label" : "not-offensive" }

Dataset Statistics

	Offenseval
Avg. #words	8.5
#Classes	2
Training	28,000
Validation	3277
Test	3515
Total	34792

Citation

@inproceedings{coltekin-2020-corpus,
    title = "A Corpus of {T}urkish Offensive Language on Social Media",
    author = {{\c{C}}{\"o}ltekin, {\c{C}}a{\u{g}}r{\i}},
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.lrec-1.758",
    pages = "6174--6184"
}

Contact

Uploaded and documented by Ali Safaya: alisafaya gmail com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TDD-C-202109-CC-002.md

TDD-C-202109-CC-002.md

Dataset Curation

Annotation Quality

Dataset Format

Dataset Statistics

Citation

Contact

Files

TDD-C-202109-CC-002.md

Latest commit

History

TDD-C-202109-CC-002.md

File metadata and controls

Dataset Curation

Annotation Quality

Dataset Format

Dataset Statistics

Citation

Contact