Effects of Preprocessing in Word Embedding

"A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks"

"The Role of Preprocessing for Word Representation Learning in Affective Tasks"

Contributions

We conduct a comprehensive analysis of the role of preprocessing techniques in affective tasks (including sentiment analysis, emotion classification and sarcasm detection), employing different models, over nine datasets.
We perform a comparative analysis of the accuracy performance of word embedding models when preprocessing is applied at the training phase (training corpora) and/or at the downstream task phase (classification dataset).
We evaluate the performance of our best preprocessed word vector model against state-ofthe-art pretrained word embedding models.

List of Preprocessing Factors:

Punctuation (Punc)
Spelling correction (Spell)
Stemming (Stem)
Stopwords removal (Stop)
Negation (Neg)
Pos-Tagging (POS)

Training Models:

Word2Vec (Skip-gram)
Word2Vec (CBOW)
BERT (Feature-based Approach)

Datasets used for Training Word Embeddings:

News : https://www.kaggle.com/snapcrack/all-the-news
Wikipedia : https://www.kaggle.com/jkkphys/english-wikipedia-articles-20170820-sqlite

Datasets used at downstream Tasks for Classification:

a) Sentiment Analysis:

IMDB : http://ai.stanford.edu/ amaas/data/sentiment/
Semeval 2016 : http://alt.qcri.org/semeval2016/task4/index.php
Airlines : https://www.kaggle.com/crowdflower/twitter-airlinesentiment
SST-5: https://nlp.stanford.edu/sentiment/index.html

b) Emotion Detection:

c) Sarcasm Detection:

Onion : https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection
IAC : https://nlds.soe.ucsc.edu/sarcasm2
Reddit : https://nlp.cs.princeton.edu/SARC/0.0/

Reference: Nastaran Babanejad, Ameeta Agrawal, Aijun An and Manos Papagelis, A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks, Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020).

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
BERT.py		BERT.py
BERT_text_encoder.py		BERT_text_encoder.py
BERT_tokenization.py		BERT_tokenization.py
LSTM_Binary_Classification.py		LSTM_Binary_Classification.py
LSTM_Categorical_Classificarion.py		LSTM_Categorical_Classificarion.py
NegSpacy.py		NegSpacy.py
Neg_Dic_Creator.py		Neg_Dic_Creator.py
Negetion_Creator.py		Negetion_Creator.py
POS.py		POS.py
Punc.py		Punc.py
README.md		README.md
Spell.py		Spell.py
Stem-Eval.py		Stem-Eval.py
Stem.py		Stem.py
Stop.py		Stop.py
UKWac.sorted.de.lemma.unigrams.7z		UKWac.sorted.de.lemma.unigrams.7z
UKWac.sorted.uk.word.unigrams.7z		UKWac.sorted.uk.word.unigrams.7z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Effects of Preprocessing in Word Embedding

"A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks"

"The Role of Preprocessing for Word Representation Learning in Affective Tasks"

About

Releases

Packages

Languages

NastaranBa/preprocessing-for-word-representation

Folders and files

Latest commit

History

Repository files navigation

Effects of Preprocessing in Word Embedding

"A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks"

"The Role of Preprocessing for Word Representation Learning in Affective Tasks"

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages