In this two part blog post I go over the classic problem of Twitter sentiment analysis. I found labeled Twitter data with 1.6 million tweets on the Kaggle website here. Through this analysis I'll touch on few different topics related to natural language processing and big data more generally. While 1.6 million tweets is not substantial amount of data and does not require working with Spark, I wanted to use Spark for ETL as well as machine learning since I haven't seen too many examples of how to do so in the context of Sentiment Analysis.
In the first part I go over Extract-Transform-Load (ETL) operations on text data using PySpark and MongoDB expanding on some details of Spark on the way. I then show how one can explore the data in the Mongo database using Compass and PyMongo. Spark is a great platform from performing batch ETL work on both structured and unstructed data. MongoDB is a document based NoSQL database that is fast, easy to use, allows for flexible schemas and perfect for working with text data. PySpark and MongoDB work well together allowing for fast, flexible ETL pipelines on large semi-structured data like those coming from the Twitter. While Part 1 is presented as a Juptyer notebook, the ETL job was submitted as a script BasicETL.py
in the directory ETL
.
In this second part I will go over the actual machine learning aspect of Sentiment Anlysis using SparkML and ML Pipelines to build a basic linear classifier. After building a basic model for sentiment analysis, I'll introduce techniques to improve performance like removing stop words and using N-grams. I also introduce a custom Spark Transformer class that uses the NLTK to performing stemming. Lastly, I'll review hyper-parameter tunning with cross-validation to optimize our model. Using PySpark on this datset was a little too much for my peronsal laptop so I used Spark on a Hadoop cluster with Google Cloud's dataproc and datalab. I'll touch on a few of the details of working on Hadoop and Google Cloud as well.
Part 1 was completed on my laptop and therefore all the dependencies were installed using miniconda. The required dependencies can be installed using the command,
conda create -n sparketl -f environment.yml
Part 2 was completed on Google Cloud on the dataproc image 1.3, the commands to recreate this environment are in GCP
directory and the Python dependenices to be loaded onto the Hadoop cluster are in the requirements.txt
file.