Code developed for CSCI544 Course taught by Prof. Mark Core.
Language Used - Python 3.7
Basic Functionalities : nbLearn, nbClassify, nbEvaluate
Accuracy : Spam F1 Score - 98%, Ham F1 Score - 95%
Enhanced Functionalities : modification.py, m_classify.py
- Replacing numbers with a default unique token - "NUMBER"
- Added Stopword filter with common stopwords from NLTK corpus
- Added Stopword filter with handpicked tokens such as "Subject:", ":", "", "the", "and"...
Accuracy : Spam F1 Score - 99%, Ham F1 Score - 97%
For more details please refer Report.txt
Library : pycrfsuite
Dataset : The Switchboard Corpus (SWBD) Dialog Tags Annotations (DAMSL). More => https://web.stanford.edu/~jurafsky/ws97/manual.august1.html
Baseline Features :
- Speaker changed from previuous Utterance
- First Utterance
- Token in an Utterance
- Part of Speechtag in an Utterance
Accuracy : 62 %
Advanced Features :
- Last Utterance in a Dialogue
- First Token in an Utterance
- Last Token in an Utterance
- First POS in an Utterance
- Last POS in an Utterance
- Bigrams of Tokens
- Bigrams of POS
- Bigram of last token in the previous utterance and first token in the current utterance
- Individual word in the text field (Text Column in the CSV with Noise)
Accuracy : 67 %
For more details please refer Report.txt