There are a wealth of fancy NLP algorithms available today - particularly using transformers, which has overshadowed a lot of the basics of NLP such as clustering and classification. However, simple algorithms are much easier to scale and often provide an excellent basis before building more complicated models.
I'm going to work through the Twitter Disasters dataset originally made available by Crowdflower. I found it currently available here.
In this notebook, I use a simple logistic regression classifier on a dataset of 10,000 tweets to predict whether the tweet refers to a true "disaster" event, or whether the tweet is irrelevant. I focus on the interpretability of simple classification models and what that means for text data. I look at a few methods of creating text embeddings for NLP tasks and explain the use and demonstrate the significant advantages of incorporating semantic meaning into NLP tasks using models such as Word2Vec.
I extend on the interpretability of NLP models by using LIME to understand predictions made on Word2Vec embeddings, and finally, I attempt to incorporate the syntactic structure of tweets into a model's predictions by building a 1D CNN for text classification on top of Word2Vec embeddings.
The point of this notebook is to serve as an intro to NLP to get direction for where and how to proceed in improving the performance of text clustering and classification algorithms, whether that entails further dataset processing - a commonly productive endeavour - or in employing a more complicated model.
The steps I've taken here constitute a good baseline to get started on an NLP project, though by no means are they comprehensive.
- Intro
- The Dataset
- Text Embeddings
- Fitting a Classifier for Baseline Performance
- TF-IDF Bag of Words
- Understanding Semantic Meaning
- Intepretability with LIME
- Using CNNs for Text Classification
- Take-Away
If you find this work useful in your own research, please cite as follows:
@misc{Zenkov-NLP-basics,
author = {Zenkov, Ilia},
title = {NLP-basics-keras-nltk-lime},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/IliaZenkov/NLP-basics-keras-nltk-lime}},
}