CSCI544-Natural_Language_Processing

Code developed for CSCI544 Course taught by Prof. Mark Core.

Language Used - Python 3.7

1. Spam Filtering using Naive Bayes Classifier

Basic Functionalities : nbLearn, nbClassify, nbEvaluate

Accuracy : Spam F1 Score - 98%, Ham F1 Score - 95%

Enhanced Functionalities : modification.py, m_classify.py

Replacing numbers with a default unique token - "NUMBER"
Added Stopword filter with common stopwords from NLTK corpus
Added Stopword filter with handpicked tokens such as "Subject:", ":", "", "the", "and"...

Accuracy : Spam F1 Score - 99%, Ham F1 Score - 97%

For more details please refer Report.txt

Library : pycrfsuite
Dataset : The Switchboard Corpus (SWBD) Dialog Tags Annotations (DAMSL). More => https://web.stanford.edu/~jurafsky/ws97/manual.august1.html

Baseline Features :

Accuracy : 62 %

Advanced Features :

Last Utterance in a Dialogue
First Token in an Utterance
Last Token in an Utterance
First POS in an Utterance
Last POS in an Utterance
Bigrams of Tokens
Bigrams of POS
Bigram of last token in the previous utterance and first token in the current utterance
Individual word in the text field (Text Column in the CSV with Noise)

Accuracy : 67 %

For more details please refer Report.txt

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Sequence Labelling		Sequence Labelling
Spam Classifier		Spam Classifier
.gitattributes		.gitattributes
README.md		README.md