Skip to content

A study of the impact of text cleaning on various text embedding methodologies

Notifications You must be signed in to change notification settings

rdemarqui/sentiment_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sentiment Analysis

Overview

Sentiment analysis is a segment of machine learning that deciphers emotions within textual data. By employing sophisticated algorithms, it classifies text as positive, negative, or neutral, enabling invaluable insights across industries. From enhancing customer experiences to gauging public opinion, sentiment analysis shapes decision-making in our data-driven world.

There are many off-the-shelf solutions to work with text data, especially for the English language. Unfortunately, those options are restricted when we need to work with the Portuguese language. To explore some possible solutions, in this study, a series of methodologies were applied, both in the data preprocessing and in the text embedding, as can be seen further in the methodology section.

Objectives

Measure the impact of different text pre-processing methodologies for portuguese language, such as stop word removal, lemmatization, and stemming in different types of word embedding, from the simplest ones like bag of words to transfomers like BERT.

Tecnologies Used

  • python 3.9.16
  • nltk 3.8.1
  • spacy 3.6.1
  • pandas 1.5.3
  • numpy 1.23.5
  • sklearn 1.2.2
  • lightgbm 4.0.0
  • matplotlib 3.7.1
  • seaborn 0.12.2
  • gensim 4.3.1
  • torch 2.0.1+cu118
  • transformers 4.32.1
  • sentence_transformers 2.2.2

About the Data

For this study, we used the dataset B2W-Reviews01 which is an open corpus of product reviews. It contains more than 130k e-commerce customer reviews, collected from the americanas.com website between January and May 2018. [1]

Methodology

This project was divided into two stages: Preprocessing and Vectorization.

For pre-processing, we apply several text-cleaning methodologies. We started by applying text cleaning, such as uncasing all words, and removing punctuation, accentuations, and special characters. After that, we applied different methodologies for text normalization. For stemming, we used the nltk package, while for lemmatization, we used spacy. Finally, we combined some of these solutions, resulting in six columns: review_text_clean (uncased, punctuation, accentuation, and special characters removed); review_text_clean_stop (review_text_clean with stop words removed); review_text_clean_stem (review_text_clean stemmed); review_text_clean_stop_stem (review_text_clean with stop words removed and stemmed); review_text_clean_lemma (review_text_clean lemmatized); review_text_clean_stop_lemma (review_text_clean with stop words removed and lemmatized).

This process can be reproduced using the Text Preprocessing notebook.

As can be seen in Figure 1, stemming was the action that most reduced the text vocabulary size:


Figure 1 - Vocabulary size

After text preprocessing, several text vectorization (embedding) methods were tested in each of the six text columns. First, we used sklearn to implement Bag of Words and TF-IDF. Afterward, we used gensim to implement Word2Vec (CBOW and Skip-gram), FastText, and Doc2Vec (DBOW and DM). Finally, two Portuguese fine-tuned pre-trained models were implemented: BERT neuralmind/bert-base-portuguese-cased[2] and Sentence Transformer rufimelo/bert-large-portuguese-cased-sts[3], both available on the HuggingFace website.

Most embedding models give a vector for each word; in those cases, a mean was applied, resulting in a (1, n) vector.

For text classification, we chose lightgbm due to its good accuracy, robustness, and speed. This part of implementation is available on Vectorization notebook.

Results and Conclusions

In this topic, we will compare all results based on the test dataset.

In table 1, we can check the score of each model applied in each text preprocessing method:


Table 1 - Overall Score

Comparing all preprocessing methods, on average, lemmatization brought the best result (review_text_clean_lemma = ROC 0.97125). Surprisingly, removing stop words did more harm than good in all cases.

The next chart compares all vectorization methods:


Figure 2 - Vectorization methods comparison

Bag of Words, TF-IDF, and Word2Vec models showed similar results. FastText performed a little worse but gave very concise results. Apparently, it is indifferent to text preprocessing methods. Surprisingly, Doc2Vec performed worse than the others. Finally, BERT sentence transformer obtained the best result, but with wide variations among the preprocessing methods.

In the figure 3, we ranked the top 10 best results. We can see that BERT sentence transformer gave the first two best results. In third place, TF-IDF with stemming brought good results. Comparing the first place with the tenth, we see less than one point of difference, i.e., from 0.98493 to 0.97689.


Figure 3 - Top 10 best aproach

Finally, in the figure 4, we can compare the results of each of the models applied to each preprocessing method:


Figure 4 - Score comparison

As noted earlier, BERT sentence transformer got the best result. Its best performance was with just clean text (review_text_clean), without any additional preprocessing.

An item that must be taken into account is the processing time, and in this case, BERT performed worse (even using GPU) when compared to the other models. To process the six different types of text, BERT spent 02:05:43 and obtained a maximum ROC of 0.984702, while TF-IDF spent 00:10:06 to process the same amount of data and obtained a maximum ROC of 0.97907. This trade-off needs to be pondered when implementing it in production.

Future Improvements

Despite relatively worse performances between models, the results obtained were very good. Frederico Souza and João Filho obtained good results too, using TF-IDF and Logistic Regression[4]. This is probably due to the quantity and quality of the available data. A way to check if these results are consistent is to use this same code applied to other dataframes [5][6][7].

References

About

A study of the impact of text cleaning on various text embedding methodologies

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published