Skip to content

Assignment project to analyze review text and build a machine learning model to classify reviews to 5 classes of ratings. CNN-LSTM model is developed with Word2Vec Embeddings to classify the text.

License

Notifications You must be signed in to change notification settings

HkFromMY/review-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Yelp Ratings Classification using Deep Learning

Problem Statement

As of the year 2020, 63.6% of consumers are likely to browse through online reviews before making a visit to the restaurant, and 90% of respondents claimed that their buying decisions are greatly influenced by positive star ratings (Sun, 2022). Therefore, this indicates a pressing need to develop effective machine learning algorithms to analyse and categorize the plethora of online reviews and ratings, particularly focusing on restaurant reviews to aid customers in making informed decisions and enhancing their dining experiences.

Dataset Sources

Note: The dataset extracted inside the zip file need to be added '.zip' extension again (or other compressed file format) to get the real data.

image

Data Preprocessing Steps

  1. Stripping additional whitespaces
  2. Convert to lowercase
  3. Punctuations removal
  4. Stop words removal
  5. Duplicates removal
  6. URLs removal
  7. Lemmatization

Exploratory Data Analysis (EDA)

image

  • The above is the top 10 words appears in the negative review text. Customers mention "customer_service" when they are about to give bad reviews which is common because that is one of the touch points that the customer can access. Bigrams such as "go_back", "come_back", and "first_time" are mentioned as well, but probably preceded by negation words such as "not" or "don't".

image

  • In the image above, positive bigrams such as "highly recommended", "love place", and "staff friendly" are mentioned which is common in the positive review text for restaurant. Hence, the model should be able to classify more positive reviews correctly compared to negative reviews.

image

  • Based on the graph above, most instances are positive reviews, where the least is neutral. The sentiments is generated by categorising ratings more than 3 as positive, and negative otherwise. The ratings of 3 indicates that the review text is neutral.

Model Selected

The model selected to perform text classification is as follow:

  1. Logistic Regression
  2. Light Gradient Boosting Machine (LightGBM)
  3. CNN-LSTM Deep Learning Model with Word2Vec Embedding
  4. Extreme Gradient Boosting (XGBoost)

I am responsible for developing the CNN-LSTM model while the other teammates are responsible for the remaining models respectively.

Model Architecture

image

  • The above is the model architecture of CNN-LSTM model proposed. It consists of embedding layers that contains Word2Vec embeddings after fitting with the training set.
  • Dropout and batch normalization layers are used to avoid overfitting the model.
  • Convolutional layer is used to extract the key features of the dataset which is then passed to the LSTM layer.
  • Bidirectional layers is used for LSTM to ensure that the 2-ways dependencies of the review texts can be captured by the model successfully due to the nature of the language.
  • Before passing to the fully connected layer (consist of only 1 Dense layer), the weights are pooled using maximum metrics for dimension reduction so that the model efficiency can be ensured.

Training/Testing Splits

  • The entire dataset is sampled with size of 400 thousands (K) instances due to machine constraints.
  • The training and testing sets are splitted based on the ratio of 8:2
  • The training set is further partitioned into 9:1 where 10% of the training set is used for validation set during the fitting of the model.

Baseline Performance

image

  • The above is the accuracy in predicting each rating classes (range from 1 to 5)
  • As predicted, the model can correctly classify most positive reviews (with rating of 5) but performs badly on the other classes. This may be due to the ambiguity between classes 1, 2 and classes 4, 5 because they contains similar sentiments.

Hyperparameter Tuning

  • Hyperparameter tuning is performed to ensure that the model is performing with optimal set of hyperparameters
  • Grid Search is implemented for my model because it can scans all posibilities given the parameter grids.
  • The only disadvantage is that Grid Search takes a lot of time.
  • Alternative approach is Random Search, but may not search the global optima. image
  • The above is the parameter grid that is passed to the tuner to tune the model
  • CNN filters represents the filter size in Convolutional layer.
  • Hidden Units is the number of neurons in the fully connected layer (Dense layer).
  • Dropout rates is the fraction of data to drop out in Dropout layer.
  • Learning rate controls the size of learning step of the model used by the optimizer (Adam).
  • Batch size refers to the number of samples used during 1 iteration in the fitting process.
  • I had used EarlyStopping callbacks to control the epochs used to train the models, but the range is between 1 and 20.

Evaluation of Tuned Model

image

  • Based on the graph above, we can see that the accuracy for most classes increased drastically compared to the baseline model.
  • The extreme classes such as rating of 1 and 5 has around 80% accuracy which is common because it is easier to classify due to less ambiguity.
  • For rating of 2 and 4, the ambiguity increases as the positive review text may belong to 1 and negative belongs to 5, so the model finds it difficult to distinguish the difference.
  • Rating of 3 is as the same reason as rating of 2 and 4.
  • Imbalanced class may not be the main problem because the rating of 1 achieved 80% accuracy despite being the minority class.

Conclusion & Recommendations

  • To handle text classification, the ambiguity in text is the main challenge that we should tackle as it impacts the model's performance a lot.
  • As discussed in the analysis, the model's performance is affected because the model cannot distinguish the rating of 2 with rating of 1 and rating of 4 with rating of 5.
  • Therefore, the ambiguity can be handled by filtering the unique features for each class. For instance, we can filter the least common word in for each ratings that are overlapped in other ratings so that the diference can be bigger.

Credits

My teammates:

  1. @bipolarpineapple (Cheryl)
  2. Lau Joe Ying
  3. Cheong Jun Xian

About

Assignment project to analyze review text and build a machine learning model to classify reviews to 5 classes of ratings. CNN-LSTM model is developed with Word2Vec Embeddings to classify the text.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published