SPILTER

In this project I build a model for classifying the Email & SMS into spam or not spam using machine learning.

What It Does:

Preview:

demo.mp4

How It Does:

Extract the text and the target class from the dataset. Extract the features of the test using TF-IDF vectorizer for the Input features. Used MultinomialNB standard classifier to classify the data into spam or not spam.

Prerequisites:

I would highly recommend that before the hack night you have some kind of toolchain and development environment already installed and ready. If you have no idea where to start with this, try a combination like:

Python
scikit-learn / sklearn
Pandas
NumPy
matplotlib
An environment to work in - something like Jupyter or Spyder For Linux people, your package manager should be able to handle all of this. If it somehow can't, see if you can at least install Python and pip and then use pip to install the above packages.

Dataset:

The SMS/Email Spam Collection is a set of SMS tagged messages that have been collected for SMS/Email Spam research. It contains one set of SMS messages in English of 5,567 messages, tagged according being ham (legitimate) or spam.

You can collect raw dataset from here.

The files contain one message per line. Each line is composed by two columns:

Class- contains the label (ham or spam)
Message - contains the raw text.

Model Pipeline:

Components:

Using TF-IDF for feature extraction of the text data for the messages.
Use splits for skewed data(Since the number of ham are far more than the number of spam messages,the data is not balanced.)
Use different standard classifiers for classification of the SMS/Emails.
Compare the accuracy of various classifiers using standard classification metrics

Accuracy Result:

import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=3000)
x = tfidf.fit_transform(df["transformed_text"]).toarray()
y = df["target"].values

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2)

from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()

mnb.fit(x_train,y_train)
y_pred = mnb.predict(x_test)
print("MultinomialNB TfidfVectorizer with max_features=3000")
print(f"accuracy: {accuracy_score(y_test,y_pred)}")
print(f"precision: {precision_score(y_test,y_pred)}")

Multinomial Naive Bayes with TfidfVectorizer having max_features=3000

accuracy: 0.9709864603481625
precision: 1.0

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.streamlit		.streamlit
demo		demo
heroku		heroku
jupyter		jupyter
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
app.py		app.py
example.txt		example.txt
model.pkl		model.pkl
nltk.txt		nltk.txt
requirements.txt		requirements.txt
setup.sh		setup.sh
spam-email-and-sms-classifier.ipynb		spam-email-and-sms-classifier.ipynb
vectorizer.pkl		vectorizer.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPILTER

In this project I build a model for classifying the Email & SMS into spam or not spam using machine learning.

What It Does:

Preview:

How It Does:

Prerequisites:

Dataset:

Model Pipeline:

Components:

Accuracy Result:

About

Releases

Packages

Languages

arjunan-k/Spilter

Folders and files

Latest commit

History

Repository files navigation

SPILTER

In this project I build a model for classifying the Email & SMS into spam or not spam using machine learning.

What It Does:

Preview:

How It Does:

Prerequisites:

Dataset:

Model Pipeline:

Components:

Accuracy Result:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages