References

Background

Natural language processing (NLP) is a subfield of computer science and artificial intelligence that uses machine learning to enable computers to understand and communicate with human language¹.

In this repository, we'll implement the following NLP tasks:

Sentiment Analysis
Entity Extraction
Text Summarization

Dataset

We'll use the following publicly available dataset, from Amazon food reviews.

The data dictionary is as follows:

Column Name	Description	Data Type
Id	Row ID	int64
ProductId	Unique identifier for Product	object
UserId	Unique identifier for User	object
ProfileName	Profile name of the user	object
HelpfulnessNumerator	Number of users who found the review helpful	int64
HelpfulnessDenominator	Number of users who indicated wether they found the review helpful or not	int64
Score	Rating between 1 and 5	int64
Time	Timestamp for the review	int64
Summary	Brief summary of the review	object
Text	Full review	object

EDA

Follow the notebook located on the jupyter_notebooks directory. The main finding is with regards to the class balances of the review Score:

As seen in the graph above, the score of 5 is by far the most popular, compared to the other scores.

Preprocessing Pipeline

1. Balancing Data

As noted in the EDA, there is a class imbalance in the Score of the reviews, so we'll address it by:

Mapping the score from 1-5 to 0-2 (bad, neutral, and good respectively)
Remove duplicate reviews
Downsampling the category with the highest review

2. Text Cleaning

In this step we'll remove text that doesn't convey any meaningful information such as

HTML tags
URLs
Excessive whitespace

Note that at this point we're not removing any punctuation, numbers, or special symbols. I want to leave the text human-readable prior to the tokenization step.

3. Tokenization

We'll use the spaCy library to perform:

tokenization
stop word and punctuation removal
lemmatization

Modelling

Sentiment Analysis

In this section I want to try different approaches to perform a sentiment analysis (predict if the text conveys positive, neutral, or negative sentiment) on the reviews. We'll implement and compare the following models.

Bag of words model with Count Vectorizer
TFID
LSTM
Other pre-trained models

Work In Progress

Finalize selecting all the model evaluation metrics
Modelling with TFID
Modelling with Pre-trained models

References

https://www.ibm.com/topics/natural-language-processing ↩

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
images		images
jupyter_notebooks		jupyter_notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Background

Dataset

EDA

Preprocessing Pipeline

Modelling

Sentiment Analysis

Work In Progress

References

About

Releases

Packages

Languages

License

bcrodrigo/nlp_reviews

Folders and files

Latest commit

History

Repository files navigation

Background

Dataset

EDA

Preprocessing Pipeline

Modelling

Sentiment Analysis

Work In Progress

References

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages