Skip to content

Classification of tweets pertinent to disaster events. NLP basics with a focus on text embedding, interpretability and the use of LIME, and Keras to build a 1D CNN

License

Notifications You must be signed in to change notification settings

IliaZenkov/NLP-keras-nltk-lime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NLP Quick Start

There are a wealth of fancy NLP algorithms available today - particularly using transformers, which has overshadowed a lot of the basics of NLP such as clustering and classification. However, simple algorithms are much easier to scale and often provide an excellent basis before building more complicated models.

I'm going to work through the Twitter Disasters dataset originally made available by Crowdflower. I found it currently available here.

Abstract

In this notebook, I use a simple logistic regression classifier on a dataset of 10,000 tweets to predict whether the tweet refers to a true "disaster" event, or whether the tweet is irrelevant. I focus on the interpretability of simple classification models and what that means for text data. I look at a few methods of creating text embeddings for NLP tasks and explain the use and demonstrate the significant advantages of incorporating semantic meaning into NLP tasks using models such as Word2Vec.

I extend on the interpretability of NLP models by using LIME to understand predictions made on Word2Vec embeddings, and finally, I attempt to incorporate the syntactic structure of tweets into a model's predictions by building a 1D CNN for text classification on top of Word2Vec embeddings.

The point of this notebook is to serve as an intro to NLP to get direction for where and how to proceed in improving the performance of text clustering and classification algorithms, whether that entails further dataset processing - a commonly productive endeavour - or in employing a more complicated model.

The steps I've taken here constitute a good baseline to get started on an NLP project, though by no means are they comprehensive.

Table of Contents:

Cite

If you find this work useful in your own research, please cite as follows:

@misc{Zenkov-NLP-basics,
  author = {Zenkov, Ilia},
  title = {NLP-basics-keras-nltk-lime},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/IliaZenkov/NLP-basics-keras-nltk-lime}},
}

Licence

License: MIT

About

Classification of tweets pertinent to disaster events. NLP basics with a focus on text embedding, interpretability and the use of LIME, and Keras to build a 1D CNN

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published