Skip to content
This repository has been archived by the owner on Oct 24, 2019. It is now read-only.

Latest commit

 

History

History
94 lines (70 loc) · 5.98 KB

README.md

File metadata and controls

94 lines (70 loc) · 5.98 KB

Detecting toxicity in news articles - a study for news in Bulgarian language

Motivation

Current repository is used for feature generation and experiment evaluation system. Generated results are being used in a research inspired by RANLP'19 conference.

Reproducing experiments

If you plan to continue research on top of this or to reproduce, please check Quick Start in docs.

Dataset

Contains 221 articles, manually labelled by Krasimir Gadjokov between 2011-2017. As well as 96 non-toxic articles fetched from credible bulgarian news outlets in 2019. To incorporate even more features we use Google API for articles translation. Each article is available in both english and bulgarian.

Toxicity categories are as follows (examples are in Bulgarian):

Category Example
fake news (фалшиви новини) click here
defamation (клевета) click here
sensation (сензация) click here
hate speech (реч на омраза) click here
delusion (заблуда) click here
conspiracy (конспирация) click here
anti-democratic (анти-демократичен) click here
pro-authoritarian (про-авториратерен) click here
non-toxic (нетоксичен) click here

Labels' source of truth: https://mediascan.gadjokov.com/

Dataset can be downloaded from here.

Detailed information about dataset, can be found in docs.

Features

We have generated following feature sets for both English and Bulgarian:

Language Feature set Title Text
Bulgarian BERT 768 768
Bulgarian LSA 15 200
Bulgarian Stylometry 19 6
Bulgarian XLM 1024 1024
English BERT 768 768
English ELMO 1024 1024
English NELA 129 129
English USE 512 512
- Media 6

Experiments

We have conducted experiments by combinding different feature sets, as well as introducing a meta classification. Meta classification is based on posterior probablities of other experiments result. For each experiment setup we use fine-tuned LogisticRegression. Provided results are avaraged after 5-fold experiment split.

Language Feature set Accuracy F1-macro
- Baseline 30.30 05.17
Bulgarian BERT(title), BERT(text) 47.69 32.58
Bulgarian XLM(title), XLM (text) 38.50 24.58
Bulgarian Styl(title), Styl(text) 31.89 08.51
Bulgarian LSA(title), LSA(text) 55.59 42.11
Bulgarian Bulgarian combined 39.43 24.38
English USE(title), USE(text) 53.70 40.68
English NELA(title), NELA(text) 36.36 23.04
English BERT(title), BERT(text) 52.05 39.78
English ELMO(title), ELMO(text) 54.60 40.95
English English combined 42.04 15.64
- Media meta 42.04 15.64
- All combined 38.16 26.04
- Meta classifier 59.06 39.70

References

Please cite [1] if you found the resources in this repository useful.

[1] Y Dinkov, I Koychev, P Nakov Detecting Toxicity in News Articles: Application to Bulgarian

@article{dinkov2019detecting,
  title={Detecting Toxicity in News Articles: Application to Bulgarian},
  author={Dinkov, Yoan and Koychev, Ivan and Nakov, Preslav},
  journal={arXiv preprint arXiv:1908.09785},
  year={2019}
}

Acknowledgements

This research is part of the Tanbih project, which aims to limit the effect of "fake news", propaganda and media bias by making users aware of what they are reading. The project is developed in collaboration between the Qatar Computing Research Institute (QCRI), HBKU and the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).

This research is also partially supported by Project UNITe BG05M2OP001-1.001-0004 funded by the OP "Science and Education for Smart Growth" and the EU via the ESI Funds.