Problem | Data | Methods | Libs | Link |
---|---|---|---|---|
NLP |
Text | Naive Bayesian , SVM , Random Forest Classifier , Deep Learning - LSTM , Word2Vec |
Sklearn , Keras , Gensim , Pandas , Seaborn |
https://github.com/erdiolmezogullari/ml-spam-sms-classification |
If you want to see the further ML projects, you may visit my main repo: https://github.com/erdiolmezogullari/ml-projects
In this project, We applied supervised learning (classification) algorithms and deep learning (LSTM).
We used a public SMS Spam dataset, which is not a purely clean dataset. The data consists of two different columns (features), such as context, and class. The column context is referring to SMS. The column class may take a value that can be either spam
or ham
corresponding to related SMS context.
Before applying any supervised learning methods, we applied a bunch of data cleansing operations to get rid of messy and dirty data since it has some broken and messy context.
After obtaining the cleaned dataset, we created tokens and lemmas of SMS corpus separately by using Spacy, and then, we generated bag-of-word and TF-IDF of SMS corpus, respectively. In addition to these data transformations, we also performed SVD, SVC, PCA to reduce dimension of dataset.
To manage data transformation in the training and testing phase effectively and avoid data leakage, we used Sklearn's Pipeline class. So, we added each data transformation step (e.g. bag-of-word
, TF-IDF
, SVC
) and classifier (e.g. Naive Bayesian
, SVM
, Random Forest Classifier
) into an instance of class Pipeline
.
After applying those supervised learning methods, we also performed deep learning. The deep learning architecture we used is based on LSTM. To perform LSTM approaching in Keras (Tensorflow), we needed to create an embedding matrix of our corpus. So, we used Gensim's Word2Vec approach to obtain embedding matrix, rather than TF-IDF.
At the end of each processing by using a different classifier, we plotted confusion matrix to compare which one the best classifier for filtering SPAM SMS.