This project has been designed as starting point to understand and learn about ETL process. ETL stands for Extract, Transform and Load, and the main idea of this project is to use open data available about the New York cities taxis to create an ETL, so you will extract the data from "parquet" files, load the data in memory and transform the read data into a new one.
Data dictionaries and datasets can be found in the NYC page: NYC Portal
Considered scenarios in the project:
Using the columns:
tpep_pickup_datetime
tpep_dropoff_datetime
Trip_distance
PULocationID
DOLocationID
- Using the columns:
PULocationID
DOLocationID
Trip_distance
Tolls_amount
Total_amount
- Using the columns:
tpep_pickup_datetime
tpep_dropoff_datetime
Passenger_count
Payment_type
Total_amount
To be able to run this project you should have a properly configured environment. The instructions below contains the required steps you should follow to be able to run the entire project.
Install sdkman and activate the environment
curl -s "https://get.sdkman.io" | bash
# Inside the project folder install java, spark and hadoop (required once)
sdk env install
# Then you can simply enable the environment with
sdk env
We recommend to use a virtual env with python
python -m venv venv
And then install the dependencies, -e in the custom folder reads the pyproject.toml
pip install -e .
Setup the following environment variables
TAXIS_ETL_FILES, TAXIS_ETL_DOWNLOAD_FOLDER, TAXIS_ETL_BASE_URL, TAXIS_ETL_DOWNLOAD_FILES
Files from S3
export TAXIS_ETL_FILES=yellow_tripdata_2020-01.parquet,yellow_tripdata_2021-01.parquet,yellow_tripdata_2022-01.parquet
export TAXIS_ETL_DOWNLOAD_FOLDER=downloads
export TAXIS_ETL_BUCKET_NAME=s3-bucket
Set AWS credentials
Set the aws credentials for pyspark running the commands
export AWS_ACCESS_KEY_ID=$(aws configure get aws_access_key_id)
export AWS_SECRET_ACCESS_KEY=$(aws configure get aws_secret_access_key)
if you're running from a devcontainer, check if the value of the aws credentials is passed by printing the value with the command
echo $AWS_ACCESS_KEY_ID
Executing the etl is possible with the command spark-submit
When we have a simple script that get the data from spark we can run it passing the python script
spark-submit src/taxis_etl_prft/SimpleApp.py
But if your package depends in something external, for example s3a from hadoop-aws, you will need to load the package to the spark instance, in this scenario, you will need provide the packages for example we run it with the command
spark-submit --packages org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262 src/taxis_etl_prft/DemoS3.py
- Santiago Ruiz
- Diego Peña
If you see this exception in the logs when you run the spark submit command
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
Please ensure that you're providing the packages in the spark-submit
command
If you see this exception in the logs when you run the spark submit command
Caused by: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException:
No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider
: com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY))
Do the configuration for aws credentials with aws cli
IMPORTANT is not enough for some reason the aws configure and requred to export the variables to the environment
Export aws credentials (this can be done after aws configuration)
export AWS_ACCESS_KEY_ID=$(aws configure get aws_access_key_id)
export AWS_SECRET_ACCESS_KEY=$(aws configure get aws_secret_access_key)
if you see the exception line
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoSuchFieldError: JAVA_9
Enable the java 11 from sdkman
This appears when your environment has not installed the python deps see the section "Install python dependencies"