Xflow

A Real-Time Tweet Streaming Pipeline with Deep Learning Sentiment Analysis Model for instant scoring.

The Real-Time Tweet Streaming Pipeline is built using Apache Flume, Apache Kafka & Spark-Streaming.
The LSTM based Sentiment Analysis Model is built using Keras with Tensorflow Backend.(Uses Word-Embeddings)
This Model is exposed as a RESTful Service which enables flexible usage.

Usage

Clone this repo on your system. Ensure maven is installed on your system for building it. Go to root directory of the project and run.

mvn clean install

This trains the LSTM based Deep Learning Sentiment Analysis Model and exports it as a RESTful service.
The training dataset is 'Sentiment Analysis Dataset.csv' downloaded from https://bit.ly/1TVSjsF .
The RESTful Service is hosted on http://localhost:5003/

The Sentiment prediction for any Tweet/Sentence can be obtained by sending a POST request given as follows:

curl --header "Content-Type: application/json" --request POST --data '{"data":"YOUR TWEET HERE"}' http://localhost:5003/

Download Confluent Open Source from https://www.confluent.io/download/ (Tested on v5.0). Extract it and inside the directory, run the following command:

bin/confluent start

Download and extract flume binary file from https://flume.apache.org/download.html
Clone cloudera twitter-example-github repo from https://github.com/cloudera/cdh-twitter-example

The flume-sources directory contains a Maven project with Cloudera custom Flume source designed to connect to the Twitter Streaming API and ingest tweets in a raw JSON format.

$ cd flume-sources  
$ mvn package
$ cd ..

Add the JAR to the Flume classpath. Copy flume-sources-1.0-SNAPSHOT.jar to apache-flume-latest-version-bin/plugins.d/twitter-streaming/lib/
Tweets are ingested in raw JSON format and pushed to a Kafka sink. Flume configurations are set in FlumeConfig.conf and Agent is set as Twitter Agent.

bin/flume-ng agent --conf conf --conf-file FlumeConfig.conf --name TwitterAgent -Dflume.root.logger=INFO,console

Spark Streaming is used to consume tweets from the Kafka queue. Sentiment prediction for each of the tweets is obtained by sending a POST request to the RESTful Service as described above.