Simple setup to test luigi and spark integration.
export INSTALL_DIR=$HOME
# goto https://spark.apache.org/downloads.html to select a different version/mirror
curl -OL https://ftp.nluug.nl/internet/apache/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar xvf spark-3.1.2-bin-hadoop3.2.tgz
cd spark-3.1.2-bin-hadoop3.2
echo SPARK_MASTER_HOST=localhost > conf/spark-env.sh
./sbin/start-master
./sbin/start-worker spark://localhost:7077
# logs should be in $PWD/logs
# dashboard should be at http://localhost:8080
Stopping the spark cluster:
./sbin/stop-worker
./sbin/stop-master
Ensure that spark-3.1.2-bin-hadoop3.2/bin is in your path:
export OLD_PATH=$PATH
export PATH=$OLD_PATH:$INSTALL_DIR/spark-3.1.2-bin-hadoop3.2/bin
SparkScriptTask is pretty thin wrapper around spark-submit, adding luigi monitoring and output control.
PYTHONPATH=$PWD luigi --module spark_tasks SparkScriptTask --local-scheduler --partitions 10
The SparkModuleTask runs code directly (as opposed to a stand alone script). Probably means you need to do more to ensure that identical versions of python are running on both the luigi node and the spark workers.
PYTHONPATH=$PWD luigi --module spark_tasks SparkModuleTask --local-scheduler --partitions 10