We suppose one have a version of Spark including Hadoop , Anaconda, Docker, Kubernetes (i.e. kubectl
) installed on your computer
Set up the environment
conda create -n hogwild-spark python=3.7 scipy numpy pyspark
source activate hogwild-spark
You first need to set the env var SPARK_HOME
and update the spark_home arg in the Dockerfile
in order to be able to build the image. If you want to change the name of the image, this can be done in docker-image-tool.sh
Then to build and push the image :
bash docker-image-tool.sh -t tag build
bash docker-image-tool.sh -t tag push
note : numpy
and scipy
take a long time to build the first time you create the image
You first need to update the GROUP_NAME
and IMAGE
in run.sh
Then to run the app (-w
for the number of executor)
bash run.sh -w 4
Alternatively one can combine the 2 steps (build / call) by using the additional args
bash run.sh -w 4 -t tag -n AppName -b
The simplest way to access the logs stored on the container is to create another pod and then use kubectl cp
kubectl create -f helper/shell.yml
kubectl cp shell-pod:/data/logs path/to/your/logs
To see a bash and inspect the \data
folder on can use the above pod by calling kubectl attach -t -i shell-pod
(this also allows you to delete old logs if necessary)