Approach -

We found out that the dataset was imbalanced so we used SMOTE technique to balance the dataset.
Then we created some manual features based on the reviews like num_of_words, num_unique_words, mean_word_length etc.
Then we cleaned the text, remove punctuation, top words and did lemmatization and stemming.
We used TF_IDF Vectorizer to generate numeric features from the given text.
For modeling we used RandomForestClassifier. Accuracy can be further increased using ensemble techniques.

Provide feedback

Saved searches