This Repository implements a machine learning model which analyzes tweets and predicts if they are positive, or negative. The programming laguage is scala.
In this repository Twitter dataset from Kaggle is used. The training set contains 100k examples, test set has 300k examples. The data is provided in CSV format. Data is very irregular and requires preprocessing
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - +
| ItemID | Sentiment | SentimentText |
+ - - - -+ - - - - - + - - - - - - - - - - - - - - - - -+
| 1 | 0 | is so sad for my APL friend..... |
| 2 | 0 | I missed the New Moon trailer... |
| 3 | 1 | omg its already 7:30 :O |
| ... | ... | ....................... |
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - +
The flow of the whole model development is outlined by the shceme below
The schema is implemented as pipeline in file.scala
Machine Learning model used as a classifier is Logistic Regression. Using spark parameter grid map and CrossValidator() model is tuned and crossvalidated in parallel
val paramMap = new ParamMap()
.put(tokenVectorizer.vocabSize, 10000)
.put(ngramVectorizer.vocabSize, 10000)
.put(classifier.tol, 1e-20)
.put(classifier.maxIter, 100)
val model = pipe.fit(twitterData, paramMap)
val paramGrid = new ParamGridBuilder()
.addGrid(tokenVectorizer.vocabSize, Array(10000, 20000))
.addGrid(gramVectorizer.vocabSize, Array(10000, 15000))
.addGrid(lr.tol, Array(1e-20, 1e-10, 1e-5))
.addGrid(lr.maxIter, Array(100, 200, 300))
.build()
k-fold cross validation prosedure.
val cv = new CrossValidator()
.setEstimator(pipe)
.setEvaluator(new BinaryClassificationEvaluator()
.setRawPredictionCol("prediction")
.setLabelCol("Sentiment"))
.setEstimatorParamMaps(paramGrid)
.setNumFolds(5)
.setParallelism(2)
Receiver operating characteristic is used to measure model performance.
val eval = new BinaryClassificationEvaluator()
.setLabelCol("Sentiment")
.setRawPredictionCol("prediction")
val roc = eval.evaluate(tr)
println(s"ROC: ${roc}")
Using intellij idea having build.sbt file a .jar file can be easily compiled and deployed in cluster using the following comand:
spark-submit --master yarn --deploy-mode client path/to/jar hdfs://twitter/