This Setup was tested on Ubuntu 20.04.
-
docker (Version 20.10.14)
Get the Docker engine community edition following the steps in the official documentation here.
-
docker-compose (Version 1.29.2)
Install docker compose which relies on docker engine following the steps here.
-
Create directories
mkdir -p ./dags ./logs ./plugins
-
Set Airflow UID
echo -e "AIRFLOW_UID=$(id -u)" > .env
Update in the docker-compose the variable
AIRFLOW_UID
with the result ofid -u
. In our case it is set to 1000 -
Initialize the database
docker-compose up airflow-init
For more information check Airflow's documentation.
You can start all the services by running docker-compose up -d
Access the Airflow web interface at: http://localhost:8080. The default account has the login airflow and the password airflow.
Go to Airflow web interface under Admin -> Connection -> Add new connection and add a new Spark connection like shown below:
Go to Airflow web interface under DAGS.
Search for the spark_word_count DAG. You can use the tag spark for this. You can now trigger it like shown below: