Code for paper "Author Profiling for Abuse Detection", in Proceedings of the 27th International Conference on Computational Linguistics (COLING) 2018
If you use this code, please cite our paper:
@inproceedings{mishra-etal-2018-author,
title = "Author Profiling for Abuse Detection",
author = "Mishra, Pushkar and
Del Tredici, Marco and
Yannakoudakis, Helen and
Shutova, Ekaterina",
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
month = aug,
year = "2018",
address = "Santa Fe, New Mexico, USA",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/C18-1093",
pages = "1088--1098",
}
Python3.5+ required to run the code. Dependencies can be installed with pip install -r requirements.txt
followed by python -m nltk.downloader punkt
The dataset for the code is provided in the TwitterData/twitter_data_waseem_hovy.csv file as a list of [tweet ID, annotation] pairs. To run the code, please use a Twitter API (twitter_access.py employs Tweepy) to retrieve the tweets for the given tweet IDs. Replace the dataset file with a file of the same name that has a list of [tweet ID, tweet, annotation] triples. Additionally, twitter_access.py contains functions to retrieve follower-following relationships amongst the authors of the tweets (specified in resources/authors.txt). Once the relationships have been retrieved, please use Node2vec (see resources/node2vec) to produce embeddings for each of the authors and store them in a file named authors.emb in the resources directory.
To run the best method (LR + AUTH):
python twitter_model.py -c 16202 -m lna
To run the other methods:
- AUTH:
python twitter_model.py -c 16202 -m a
- LR:
python twitter_model.py -c 16202 -m ln
- WS:
python twitter_model.py -c 16202 -m ws
- HS:
python twitter_model.py -c 16202 -m hs
- WS + AUTH:
python twitter_model.py -c 16202 -m wsa
- HS + AUTH:
python twitter_model.py -c 16202 -m hsa
For the HS and WS based methods, adding the -ft
flag to the command ensures that the pre-trained deep neural models from the Models directory
are not used and instead all the training happens from scratch. This requires that the file of pre-trained GLoVe embeddings is downloaded from
http://nlp.stanford.edu/data/glove.twitter.27B.zip, unzipped and placed in the resources directory prior to the execution.
An overview of the complete training-testing flow is as follows:
- For each tweet in the dataset, its author's identity is obtained using functions available in the twitter_access.py file. For each author, information about which other authors from the dataset follow them on Twitter is also obtained in order to create a community graph where nodes are authors and edges denote follow relationship.
- Node2vec is applied to the community graph to generate embeddings for the nodes, i.e., the authors. These author embeddings are saved to the authors.emb file in the resources directory.
- The dataset is randomly split into train set and test set.
- Tweets in the train set are used to produce an n-gram count based model or deep neural model depending on the method being used.
- A feature extractor is instantiated that uses the models from step 2 along with the author embeddings to convert tweets to feature vectors.
- LR/GBDT classifier is trained using the feature vectors extracted for the tweets in the train set. A part of the train set is held out as validation data to prevent over-fitting.
- The trained classifier is made to predict classes for tweets in the test set and precision, recall and F1 are calculated.
In the 10-fold CV, steps 3-7 are run 10 times (each time with a different set of tweets as the test set) and the final precision, recall and F1 are calculated by averaging results from across the 10 runs.