spark_streaming

History

Name		Name	Last commit message	Last commit date
parent directory ..
project		project
src/main/scala/es/dmr/uimp		src/main/scala/es/dmr/uimp
README.md		README.md
build.sbt		build.sbt
execute.sh		execute.sh
productiondata.sh		productiondata.sh
start_pipeline.sh		start_pipeline.sh
start_training.sh		start_training.sh

README.md

Streaming retail analysis with Apache Spark

Requirements

Java 8
Apache Spark 2.7
Apache Kafka
SBT (check instalation tutorial)

Setup

Sample of Spark installation in standalone mode:

wget https://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
tar xvf spark-2.0.0-bin-hadoop2.7.tgz
rm spark-2.0.0-bin-hadoop2.7.tgz
mv spark-2.0.0-bin-hadoop2.7/ /opt/spark-2.0.0-bin-hadoop2.7
echo 'export PATH=$PATH:/opt/spark-2.0.0-bin-hadoop2.7/bin' >> ~/.bashrc
source ~/.bashrc

Project build and self-contained package generation:

sbt assembly

Execution

Model train

First, a fit for kMeans y Bisection kMeans models should be performed.

chmod +x start_training.sh
./start_traning.sh

Once the traning is over, the following folders and files should have been created:

clustering/
clustering_bisect/
threshold
threshold_bisect

Streaming run

Streaming pipeline application execution:

chmod +x start_pipeline.sh
./start_pipeline.sh

Once the streaming analysis application is running, we may run the purchasses simulator:

chmod +x productiondata.sh
./productiondata.sh ../resources/retail.csv purchases

Monitoring

The information created/extrated by the streaming pipeline is written into four Kafka topics, as the architecture diagram shows. The created topics are named as:

cancelaciones
facturas_erroneas
anomalias_kmeans
anomalias_bisect_kmeans

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

spark_streaming

spark_streaming

README.md

Streaming retail analysis with Apache Spark

Requirements

Setup

Execution

Model train

Streaming run

Monitoring

Files

spark_streaming

Directory actions

More options

Directory actions

More options

Latest commit

History

spark_streaming

Folders and files

parent directory

README.md

Streaming retail analysis with Apache Spark

Requirements

Setup

Execution

Model train

Streaming run

Monitoring