Yelp Ratings Classification using Deep Learning

Problem Statement

As of the year 2020, 63.6% of consumers are likely to browse through online reviews before making a visit to the restaurant, and 90% of respondents claimed that their buying decisions are greatly influenced by positive star ratings (Sun, 2022). Therefore, this indicates a pressing need to develop effective machine learning algorithms to analyse and categorize the plethora of online reviews and ratings, particularly focusing on restaurant reviews to aid customers in making informed decisions and enhancing their dining experiences.

Dataset Sources

The URL: https://www.yelp.com/dataset
Format: JSONL

Note: The dataset extracted inside the zip file need to be added '.zip' extension again (or other compressed file format) to get the real data.

Data Preprocessing Steps

Stripping additional whitespaces
Convert to lowercase
Punctuations removal
Stop words removal
Duplicates removal
URLs removal
Lemmatization

Exploratory Data Analysis (EDA)

The above is the top 10 words appears in the negative review text. Customers mention "customer_service" when they are about to give bad reviews which is common because that is one of the touch points that the customer can access. Bigrams such as "go_back", "come_back", and "first_time" are mentioned as well, but probably preceded by negation words such as "not" or "don't".

In the image above, positive bigrams such as "highly recommended", "love place", and "staff friendly" are mentioned which is common in the positive review text for restaurant. Hence, the model should be able to classify more positive reviews correctly compared to negative reviews.

Based on the graph above, most instances are positive reviews, where the least is neutral. The sentiments is generated by categorising ratings more than 3 as positive, and negative otherwise. The ratings of 3 indicates that the review text is neutral.

Model Selected

The model selected to perform text classification is as follow:

Logistic Regression
Light Gradient Boosting Machine (LightGBM)
CNN-LSTM Deep Learning Model with Word2Vec Embedding
Extreme Gradient Boosting (XGBoost)

I am responsible for developing the CNN-LSTM model while the other teammates are responsible for the remaining models respectively.

Model Architecture

The above is the model architecture of CNN-LSTM model proposed. It consists of embedding layers that contains Word2Vec embeddings after fitting with the training set.
Dropout and batch normalization layers are used to avoid overfitting the model.
Convolutional layer is used to extract the key features of the dataset which is then passed to the LSTM layer.
Bidirectional layers is used for LSTM to ensure that the 2-ways dependencies of the review texts can be captured by the model successfully due to the nature of the language.
Before passing to the fully connected layer (consist of only 1 Dense layer), the weights are pooled using maximum metrics for dimension reduction so that the model efficiency can be ensured.

Training/Testing Splits

The entire dataset is sampled with size of 400 thousands (K) instances due to machine constraints.
The training and testing sets are splitted based on the ratio of 8:2
The training set is further partitioned into 9:1 where 10% of the training set is used for validation set during the fitting of the model.

Baseline Performance

The above is the accuracy in predicting each rating classes (range from 1 to 5)
As predicted, the model can correctly classify most positive reviews (with rating of 5) but performs badly on the other classes. This may be due to the ambiguity between classes 1, 2 and classes 4, 5 because they contains similar sentiments.

Hyperparameter Tuning

Hyperparameter tuning is performed to ensure that the model is performing with optimal set of hyperparameters
Grid Search is implemented for my model because it can scans all posibilities given the parameter grids.
The only disadvantage is that Grid Search takes a lot of time.
Alternative approach is Random Search, but may not search the global optima.
The above is the parameter grid that is passed to the tuner to tune the model
CNN filters represents the filter size in Convolutional layer.
Hidden Units is the number of neurons in the fully connected layer (Dense layer).
Dropout rates is the fraction of data to drop out in Dropout layer.
Learning rate controls the size of learning step of the model used by the optimizer (Adam).
Batch size refers to the number of samples used during 1 iteration in the fitting process.
I had used EarlyStopping callbacks to control the epochs used to train the models, but the range is between 1 and 20.

Evaluation of Tuned Model

Based on the graph above, we can see that the accuracy for most classes increased drastically compared to the baseline model.
The extreme classes such as rating of 1 and 5 has around 80% accuracy which is common because it is easier to classify due to less ambiguity.
For rating of 2 and 4, the ambiguity increases as the positive review text may belong to 1 and negative belongs to 5, so the model finds it difficult to distinguish the difference.
Rating of 3 is as the same reason as rating of 2 and 4.
Imbalanced class may not be the main problem because the rating of 1 achieved 80% accuracy despite being the minority class.

Conclusion & Recommendations

To handle text classification, the ambiguity in text is the main challenge that we should tackle as it impacts the model's performance a lot.
As discussed in the analysis, the model's performance is affected because the model cannot distinguish the rating of 2 with rating of 1 and rating of 4 with rating of 5.
Therefore, the ambiguity can be handled by filtering the unique features for each class. For instance, we can filter the least common word in for each ratings that are overlapped in other ratings so that the diference can be bigger.

Credits

My teammates:

@bipolarpineapple (Cheryl)
Lau Joe Ying
Cheong Jun Xian

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
DeepLearning - Honkit.ipynb		DeepLearning - Honkit.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yelp Ratings Classification using Deep Learning

Problem Statement

Dataset Sources

Data Preprocessing Steps

Exploratory Data Analysis (EDA)

Model Selected

Model Architecture

Training/Testing Splits

Baseline Performance

Hyperparameter Tuning

Evaluation of Tuned Model

Conclusion & Recommendations

Credits

About

Releases

Packages

Languages

License

HkFromMY/review-classification

Folders and files

Latest commit

History

Repository files navigation

Yelp Ratings Classification using Deep Learning

Problem Statement

Dataset Sources

Data Preprocessing Steps

Exploratory Data Analysis (EDA)

Model Selected

Model Architecture

Training/Testing Splits

Baseline Performance

Hyperparameter Tuning

Evaluation of Tuned Model

Conclusion & Recommendations

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages