An annotated Corpus of anonymized WhatsApp messages in PT-BR public groups for automatic detection of textual misinformation and malicious users. To get detailed information about the construction and experimentation of the corpus, check out our paper published in ICEIS 2021 conference:
Cabral, Lucas, et al. "Fakewhastapp. br: NLP and machine learning techniques for misinformation detection in brazilian portuguese whatsapp messages." Proceedings of the 23rd International Conference on Enterprise Information Systems, ICEIS. 2021.
If you use our corpus, please include a citation to our corresponding paper. For futher discussion and experiments, you can check out my master thesis (in portuguese): https://repositorio.ufc.br/handle/riufc/63379
The data collected during 2018 brazilian presidential ellections is located at:
data/2018/fakeWhatsApp.BR_2018.csv
The data is stored in a CSV file, where each line is a message sent in a public group. The dictionary of variables is the following:
id
: unique ID of a userdate
: day of the year that the message was sentddi
: international identifiercountry
: country assigned to the ddicountry_iso3
: ISO3 code of countryddd
: regional brazilian telephone codestate
: brazilian statemidia
: boolean variable indicating if the message is a media file (1) or not (0)url
: boolean variable indicating if the message contains an url (1) or don't (0)characters
: number of characters in message's textwords
: number of words in message's textviral
: boolean variable indicating if a message with the exactly same text and more of 5 words appears in the corpus (1) or don't (0). The viral messages were the ones manually labelled.shares
: number of times that a message with the exactly same text appears in the corpustext
: textual content of messagemisinformation
: manually assigned label if the message contains misinformation (1) or don't (1). The value -1 means that the message was not labelled.
-
1 - parser.ipynb
This notebook parses the data collected in WhatsApp groups, converting from free text format to structured data in a CSV table. -
2 - labeling and anonymization.ipynb
In this notebook we transfer the labels annotated manually in the viral messages to the entire corpus and remove personal data such as phone numbers present in the text. -
3 - exploratory analysis.ipynb
Exploration and visualization of the data set. -
4 - compare corpora.ipynb
Comparison with fake news corpus on Twitter to demonstrate the need for a corpus of WhatsApp texts. -
5 - misinformation detection ml.ipynb
Experiments with classical machine learning models to classify textual misinformation. -
6 - deep learning char level cnn.ipynb
Experiments with a character level convolutional neural network to classify textual misinformation. -
7 - user features.ipynb
Exploiting user features to detect misinformation -
8 - user classification.ipynb
Experiments classifying users as superspreaders -
9 - automatic dataset expansion.ipynb
Experiments with automatic expansion of dataset using cosine similarity -
10 - user credibility.ipynb
Modeling user credibility to improve misinformation detection