Spark twitter sentiment analysis

This Repository implements a machine learning model which analyzes tweets and predicts if they are positive, or negative. The programming laguage is scala.

Dataset Description

In this repository Twitter dataset from Kaggle is used. The training set contains 100k examples, test set has 300k examples. The data is provided in CSV format. Data is very irregular and requires preprocessing

+ - - - - - - - - - - - - - - - - - - - - - - - - - - - +
| ItemID | Sentiment |       SentimentText              |
+ - - - -+ - - - - - + - - - - - - - - - - - - - - - - -+
|   1    |     0     | is so sad for my APL friend..... |
|   2    |     0     | I missed the New Moon trailer... |
|   3    |     1     |    omg its already 7:30 :O       |
|  ...   |    ...    |    .......................       |
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - +

Training process structure

The flow of the whole model development is outlined by the shceme below

The schema is implemented as pipeline in file.scala

ML Model

Machine Learning model used as a classifier is Logistic Regression. Using spark parameter grid map and CrossValidator() model is tuned and crossvalidated in parallel

Distributed Hyperparameter Search

parameter grid map

val paramMap = new ParamMap()
      .put(tokenVectorizer.vocabSize, 10000)
      .put(ngramVectorizer.vocabSize, 10000)
      .put(classifier.tol, 1e-20)
      .put(classifier.maxIter, 100)

val model = pipe.fit(twitterData, paramMap)

val paramGrid = new ParamGridBuilder()
      .addGrid(tokenVectorizer.vocabSize, Array(10000, 20000))
      .addGrid(gramVectorizer.vocabSize, Array(10000, 15000))
      .addGrid(lr.tol, Array(1e-20, 1e-10, 1e-5))
      .addGrid(lr.maxIter, Array(100, 200, 300))
      .build()

Cross Validation

k-fold cross validation prosedure.

val cv = new CrossValidator()
      .setEstimator(pipe)
      .setEvaluator(new BinaryClassificationEvaluator()
      .setRawPredictionCol("prediction")
      .setLabelCol("Sentiment"))
      .setEstimatorParamMaps(paramGrid)
      .setNumFolds(5)
      .setParallelism(2)

Performance Measurement

Receiver operating characteristic is used to measure model performance.

val eval = new BinaryClassificationEvaluator()
      .setLabelCol("Sentiment")
      .setRawPredictionCol("prediction")

val roc = eval.evaluate(tr)
println(s"ROC: ${roc}")

Configuring and Run on Cluster

Using intellij idea having build.sbt file a .jar file can be easily compiled and deployed in cluster using the following comand:

spark-submit --master yarn --deploy-mode client path/to/jar hdfs://twitter/

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
artifacts/twitter_sentiment_jar		artifacts/twitter_sentiment_jar
sentiment-classifier/metadata		sentiment-classifier/metadata
src		src
twitter_data		twitter_data
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
pipeline.png		pipeline.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark twitter sentiment analysis

Dataset Description

Training process structure

ML Model

Distributed Hyperparameter Search

Cross Validation

Performance Measurement

Configuring and Run on Cluster

About

Releases

Packages

Languages

Gci04/spark-twitter-sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

Spark twitter sentiment analysis

Dataset Description

Training process structure

ML Model

Distributed Hyperparameter Search

Cross Validation

Performance Measurement

Configuring and Run on Cluster

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages