Contributions
- We conduct a comprehensive analysis of the role of preprocessing techniques in affective tasks (including sentiment analysis, emotion classification and sarcasm detection), employing different models, over nine datasets.
- We perform a comparative analysis of the accuracy performance of word embedding models when preprocessing is applied at the training phase (training corpora) and/or at the downstream task phase (classification dataset).
- We evaluate the performance of our best preprocessed word vector model against state-ofthe-art pretrained word embedding models.
List of Preprocessing Factors:
- Punctuation (Punc)
- Spelling correction (Spell)
- Stemming (Stem)
- Stopwords removal (Stop)
- Negation (Neg)
- Pos-Tagging (POS)
Training Models:
- Word2Vec (Skip-gram)
- Word2Vec (CBOW)
- BERT (Feature-based Approach)
Datasets used for Training Word Embeddings:
- News : https://www.kaggle.com/snapcrack/all-the-news
- Wikipedia : https://www.kaggle.com/jkkphys/english-wikipedia-articles-20170820-sqlite
Datasets used at downstream Tasks for Classification:
a) Sentiment Analysis:
- IMDB : http://ai.stanford.edu/ amaas/data/sentiment/
- Semeval 2016 : http://alt.qcri.org/semeval2016/task4/index.php
- Airlines : https://www.kaggle.com/crowdflower/twitter-airlinesentiment
- SST-5: https://nlp.stanford.edu/sentiment/index.html
b) Emotion Detection:
- SSEC: http://www.romanklinger.de/ssec/
- ISEAR : https://github.com/PoorvaRane/Emotion-Detector/blob/master/ISEAR.csv
- Alm : http://people.rc.rit.edu/~coagla/affectdata/index.html
c) Sarcasm Detection:
- Onion : https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection
- IAC : https://nlds.soe.ucsc.edu/sarcasm2
- Reddit : https://nlp.cs.princeton.edu/SARC/0.0/
Reference: Nastaran Babanejad, Ameeta Agrawal, Aijun An and Manos Papagelis, A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks, Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020).