Skip to content

Latest commit

 

History

History
31 lines (17 loc) · 976 Bytes

Readme.md

File metadata and controls

31 lines (17 loc) · 976 Bytes

Real world spark streaming and predictive data modelling.

Email spam Classification

  1. Each record consists of 3 features - the subject, the email content and the label

  2. Each email is one of 2 classes, spam or ham

  3. 30k examples in train and 3k in test

Dataset Link: Email spam

How to run

run the python file which will send the data over tcp connection

python3 stream.py -f <dataset name> -b <batch size>

execute the spark fetch with the help of spark submit

$SPARK_HOME/bin/spark-submit spark_fetch.py 2>log.txt

Demo to run

need to experiment with the batch size ( >1000).

running the stream.py file image

running the spark_fetch file image